method: LCT_OCR(中国科学院信息工程研究所)2019-04-30

Authors: Yujia Li , Guangzhi Zhou, Hongchao Gao(李郁佳,周广治,高红超)

Description: The architecture consists of three main components, namely encoder network, multi-perspective hierarchical attention network, and a transcription layer. We first use a basal convolutional neural network to extract multi-perspective visual representations of text imag. This can be seen as the encoder in the encoder-decoder structure. Then, we design a hierarchical attention network to obtain comprehensive text representations by fully capturing multiperspective visual representations. The network consists of three attention blocks. In each block, a local visual representation encoder module and a decoder module are designed equally as an ensemble. Finally, we concatenate the obtained fixed size sequence, which is the input of the transcription layer.
Other datasets: we used training and validation datasets from the ICDAR19 competition and other challenges (ArT, LSVT), and our self-generated datasets.
我们的结构由编码器网络、多视角层次注意网络和转录层三大部分组成。首先利用卷积神经网络提取图片文本的多视角视觉表示;然后,我们设计了一个分层注意网络,通过充分捕捉多视角的视觉表征来获取综合的文本表征。该网络由三个attention块组成。在每个块中,我们集成了局部可视化表示编码器模块和解码模块。最后,我们将得到的固定大小序列串联起来作为转录层的输入。
其他数据集: 我们使用了ICDAR19 比赛其他挑战(ArT、 LSVT)的训练及验证数据集,和我们自我合成的数据集。