method: LCT_OCR(中国科学院信息工程研究所)2019-04-30

Authors: Yujia Li , Guangzhi Zhou, Hongchao Gao(李郁佳,周广治,高红超)

Description: The architecture consists of three main components, namely encoder network, multi-perspective hierarchical attention network, and a transcription layer. We first use a basal convolutional neural network to extract multi-perspective visual representations of text imag. This can be seen as the encoder in the encoder-decoder structure. Then, we design a hierarchical attention network to obtain comprehensive text representations by fully capturing multiperspective visual representations. The network consists of three attention blocks. In each block, a local visual representation encoder module and a decoder module are designed equally as an ensemble. Finally, we concatenate the obtained fixed size sequence, which is the input of the transcription layer.
Other datasets: we used training and validation datasets from the ICDAR19 competition and other challenges (ArT, LSVT), and our self-generated datasets.
其他数据集: 我们使用了ICDAR19 比赛其他挑战(ArT、 LSVT)的训练及验证数据集,和我们自我合成的数据集。