文本识别外文翻译资料

2022-08-11 14:27:27

FOTS: Fast Oriented Text Spotting with a Unified Network

Xuebo Liu1, Ding Liang1, Shi Yan1, Dagui Chen1, Yu Qiao2, and Junjie Yan1

1SenseTime Group Ltd.

2Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

fliuxuebo,liangding,yanshi,chendagui,yanjunjieg@sensetime.com, fyu.qiaog@siat.ac.cn

Abstract

Incidental scene text spotting is considered one of the

most difficult and valuable challenges in the document analysis

community. Most existing methods treat text detection

and recognition as separate tasks. In this work, we

propose a unified end-to-end trainable Fast Oriented Text

Spotting (FOTS) network for simultaneous detection and

recognition, sharing computation and visual information

among the two complementary tasks. Specially, RoIRotate

is introduced to share convolutional features between detection

and recognition. Benefiting from convolution sharing

strategy, our FOTS has little computation overhead

compared to baseline text detection network, and the joint

training method learns more generic features to make our

method perform better than these two-stage methods. Experiments

on ICDAR 2015, ICDAR 2017 MLT, and ICDAR

2013 datasets demonstrate that the proposed method outperforms

state-of-the-art methods significantly, which further

allows us to develop the first real-time oriented text

spotting system which surpasses all previous state-of-theart

results by more than 5% on ICDAR 2015 text spotting

task while keeping 22.6 fps.

1. Introduction

Reading text in natural images has attracted increasing

attention in the computer vision community [49, 43, 53, 44,

14, 15, 34], due to its numerous practical applications in

document analysis, scene understanding, robot navigation,

and image retrieval. Although previous works have made

significant progress in both text detection and text recognition,

it is still challenging due to the large variance of text

patterns and highly complicated background.

The most common way in scene text reading is to divide

it into text detection and text recognition, which are handled

as two separate tasks [20, 34]. Deep learning based

approaches become dominate in both parts. In text detection,

usually a convolutional neural network is used to extract

feature maps from a scene image, and then different

decoders are used to decode the regions [49, 43, 53]. While

GIORDAN GAP

OFF

FOTS

44.2 ms

Detection 41.7 ms

Recognition

42.5 ms

GIORDAN GAP

OFF

Figure 1: Different to previous two-stage methods, FOTS solves

oriented text spotting problem straightforward and efficiently.

FOTS can detect and recognize text simultaneously with little

computation cost compared to a single text detection network

(44.2ms vs. 41.7ms) and almost twice as fast as the two-stage

method (44.2ms vs. 84.2ms). This is detailed in Sec. 4.4.

in text recognition, a network for sequential prediction is

conducted on top of text regions, one by one [44, 14]. It

leads to heavy time cost especially for images with a number

of text regions. Another problem is that it ignores the

correlation in visual cues shared in detection and recognition.

A single detection network cannot be supervised by

labels from text recognition, and vice versa.

In this paper, we propose to simultaneously consider text

detection and recognition. It leads to the fast oriented text

spotting system (FOTS) which can be trained end-to-end.

In contrast to previous two-stage text spotting, our method

learns more generic features through convolutional neural

network, which are shared between text detection and text

recognition, and the supervision from the two tasks are

complementary. Since feature extraction usually takes most

of the time, it shrinks the computation to a single detection

network, shown in Fig. 1. The key to connect detection

and recognition is the ROIRotate, which gets proper features

from feature maps according to the oriented detection

bounding boxes.

arXiv:1801.01671v2 [cs.CV] 15 Jan 2018

RoI

Rotate

Text

Detection

Branch

Text

Recognition

Branch

Predicted BBoxes

Shared Features

Text

Proposal

Features

Shared Convolutions

Predicted

texts

GIORDAN GAP

OFF

Figure 2: Overall architecture. The network predicts both text regions and text labels in a single forward pass.

The architecture is presented in Fig. 2. Feature maps

are firstly extracted with shared convolutions. The fully

convolutional network based oriented text detection branch

is built on top of the feature map to predict the detection

bounding boxes. The RoIRotate operator extracts text proposal

features corresponding to the detection results from

the feature map. The text proposal features are then fed

into Recurrent Neural Network (RNN) encoder and Connectionist

Temporal Classification (CTC) decoder [9] for

text recognition. Since all the modules in the network are

differentiable, the whole system can be trained end-to-end.

To the best of our knoweldge, this is the first end-to-end

trainable framework for oriented text detection and recognition.

We find that the network can be easily trained without

complicated post-processing and hyper-parameter tuning.

The contributions are summarized as follows.

_ We propose an end-to-end trainable framework for fast

oriented text spotting.

剩余内容已隐藏，支付完成后下载完整资料

FOTS: Fast Oriented Text Spotting with a Unified Network

Xuebo Liu¹, Ding Liang¹, Shi Yan¹, Dagui Chen¹, Yu Qiao², and Junjie Yan¹ 1SenseTime Group Ltd.

2Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

{liuxuebo,liangding,yanshi,chendagui,yanjunjie}@sensetime.com, {yu.qiao}@siat.ac.cn

Abstract

GIORDAN

GAP

FOTS

44.2 ms

OFF

Detection 41.7 ms

GIORDAN

GAP

Recognition

42.5 ms

OFF

及时场景中的文本识别在文档分析领域被认为是最困难且极具价值的挑战之一。现在大多的解决方法中将文本检测和识别视为两个分隔的任务。在我们的工作中，提出了一个对文本同时检测和识别的，统一的，支持端到端训练的FOTS网络模型，并且在检测和识别两端共享计算和视觉信息。具体来说，RoIRotate的引入实现了检测和识别之间的卷积特性共享。得益于卷积共享策略，我们的FOTS模型与其他文本检测网络比起来计算量很小，同时联结训练的模式使我们模型的表现由于其他两阶段模型。在ICDAR2015,ICDAR2017 MLT和ICDAR2013数据集上的测试表明我们的模型优于当前SOTA的方法，这使得我们能够进一步开发第一个实时多向文本识别系统，这个系统在保持22.6fps的前提下，在ICDAR2015文本识别任务中比SOTA方法的结果优秀5%。

Introduction

在场景图像中识别文本已经引起计算机视觉界越来越多的关注 [49, 43, 53, 44, 14, 15, 34], 得益于其在文档分析、场景理解、机器人导航和图像检索等方面的大量实际应用。虽然前人的研究在文本检测和文本识别两方面都取得了很大的进展，但由于文本模式的多样性和背景的复杂性，文本识别仍然具有一定的挑战性。

场景文本识别中最常见的方法是将场景文本分为文本检测和文本识别两部分，并将其作为两个独立的任务进行处理 [20, 34]. 深入学习的方法在这两方面都占据主导地位。在文本检测中,通常一个卷积神经网络用于从场景图像中提取特征,然后不同解码器用于对区域进行解码[49, 43, 53].

Figure 1: Different to previous two-stage methods, FOTS solves oriented text spotting problem straightforward and efficiently. FOTS can detect and recognize text simultaneously with little computation cost compared to a single text detection network (44.2ms vs. 41.7ms) and almost twice as fast as the two-stage method (44.2ms vs. 84.2ms). This is detailed in Sec. 4.4.

而在文本识别中，一个用于序列预测的网络作用于文本区域 [44, 14]. 这将导致大量的时间开销，特别是对于带有大量文本区域的图像。另一个问题是它忽略了检测和识别过程中图像特征的相关性。单个检测网络不能由文本识别的标签进行监督，反之亦然。

在本文中，我们同时考虑文本检测和识别. 这形成了支持进行端到端训练的快速定向的文本识别系统(FOTS)。与之前的两阶段文本检测相比，我们的方法通过卷积神经网络来学习更好的特征，卷积神经网络在文本检测和文本识别之间是共享的。由于特征提取通常花费大部分时间，这将计算规模缩减到单个检测网络，如图1. 联结文本检测和识别模块的关键是ROIRotate, 它从定向Bounding box中获取特定特征。

Text Detection Branch

Shared Convolutions

Predicted BBoxes

Text

Text Recognition Branch

GIORDAN GAP

OFF

RoI

Proposal Features

Predicted

texts

Shared Features Rotate

Figure 2: Overall architecture. The network predicts both text regions and text labels in a single forward pass.

网络架构如图2. 首先利用共享卷积提取特征图。在特征图的基础上，建立基于全卷积网络的定向文本检测分支，预测Bounding box。RoIRotate操作从特征图中提取与检测结果对应的文本特征。之后，文本特征输入RNN进行编码，由CTC解码出最终结果。由于网络中的所有模块都是可微的，因此整个系统可以端到端的训练。据我们所知，这是第一个面向文本检测和识别的端到端的可训练模型。我们发现,在没有复杂的后处理和超参数调优的情况下,网络可以很容易地训练。

主要贡献如下：

我们提出了一个端到端的可训练模型，用于快速定向文本识别。通过共享卷积特征，网络可以同时检测和识别文本，同时减少了计算量，提高了实时性。
我们介绍了RoIRotate,一个新的可微分操作,从卷积的特征映射中提取定向文本的区域。这个操作将文本检测和识别统一到端到端模型。
FOTS在许多文本检测和文本识别任务上显著超越了最先进的方法，包括ICDAR 2015[26]、ICDAR 2017MLT[1]和ICDAR 2013 [27]

Related Work

文本识别是计算机视觉和文档分析中的一个活跃课题。在这一节中，我们将简要介绍相关的工作，包括文本检测、文本识别方法。

Text Detection

大多数传统的文本检测方法都将文本视为字符的组合。这些基于字符的方法首先本地化图像中的字符，然后将它们分组为单词或文本行。滑动窗口系方法 [22, 28, 3, 54] 和连接域方法[18, 40, 2] 是两个典型的方法。

近年来，人们提出了许多基于深度学习的方法来直接检测图像中的单词。Tian et al. [49] 采用垂直anchor机制预测固定宽度区域,然后连接它们。Ma et al. [39] 通过提出Rotate RPN和Rotate RoI pooling，提出了一种面向任意文本的基于旋转的框架。Shi et al. [43] 首先预测文本片段，然后使用链接预测将它们链接到完整的实例中。基于密集预测和一部后处理, Zhou et al. [53] and He et al. [15] 提出了面向多目标场景文本检测的深度直接回归方法。

Text Recognition

一般情况下，场景文本识别的目的是解码标签序列从定期裁剪但可变长度的文本图像。以前的大多数方法[8,30]获取单个字符，然后对错误分类的字符进行细化。除了字符级方法外，近年来的文本区域识别方法可分为三类:基于单词分类的方法、基于序列到标签解码的方法和基于序列到序列模型的方法。

Jaderberg et al. [19] 将单词识别问题作为一个传统的多类分类任务，使用大量的类标签。Su et al. [48] 文本识别看作序列标记问题，其中RNN基于HOG特征，采用CTC作为解码器。Shi et al.[44] and He et al. [14] 提出深度递归模型对max-out CNN特征进行编码，并采用CTC对编码序列进行解码。 Fujii et al. [5] 提出一种编码器和摘要器网络来执行行级脚本标识。Lee et al. [31] 使用一个基于注意力的序列结构，自动聚焦于某些所提取的CNN特征，并隐式学习RNN中包含的字符级语言模型。为处理不规则的文本, Shi et al. [45] and Liu et al. [37] 引入空间注意机制，将扭曲的文本区域转换为适合识别的规则样式。

Text Spotting

大多数以前的文本识别方法首先使用文本检测模型生成文本建议，然后使用单独的文本识别模型识别它们。

Conv1 Pool1 Res2

Input Image

1/4

1/8

1/16

Deconv

Res5

Res4

Res3

1/32

Figure 3: Architecture of shared convolutions. Conv1-Res5 are operations from ResNet-50, and Deconv consists of one convolu- tion to reduce feature channels and one bilinear upsampling oper- ation.

Jaderberg et al. [20] 首先使用集成模型生成高召回率的整体文本区域，然后使用单词分类器进行单词识别。Gupta et al. [10] 训练全卷积回归网络进行文本检测，采用[19]中的单词分类器进行文本识别。Liao et al. [34] 使用基于SSD[36]的文本检测方法和CRNN[44]的文本识别方法.

最近，Li et al. [33] 提出了一种端到端的文本识别方法，采用

剩余内容已隐藏，支付完成后下载完整资料

资料编号：[237199]，资料为PDF文档或Word文档，PDF文档可免费转换为Word

您需要先支付 30元 才能查看全部内容！立即支付

注册

找回密码

文本识别外文翻译资料

Xuebo Liu¹, Ding Liang¹, Shi Yan¹, Dagui Chen¹, Yu Qiao², and Junjie Yan¹ 1SenseTime Group Ltd.

Abstract

Introduction

Related Work

Text Detection

Text Recognition

Text Spotting

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

文本识别外文翻译资料

Xuebo Liu1, Ding Liang1, Shi Yan1, Dagui Chen1, Yu Qiao2, and Junjie Yan1 1SenseTime Group Ltd.

Abstract

Introduction

Related Work

Text Detection

Text Recognition

Text Spotting

您可能感兴趣的文章

最新文档

推荐栏目

Xuebo Liu¹, Ding Liang¹, Shi Yan¹, Dagui Chen¹, Yu Qiao², and Junjie Yan¹ 1SenseTime Group Ltd.