Method: 3CNN_2BiLSTM_CTC - Task 2 - Word Recognition - ICDAR2017 Robust Reading Challenge on COCO-Text

method: 3CNN_2BiLSTM_CTC2017-06-30

Authors: Ma Long

Description: The model we used for line recognition is based on a convolution recurrent neural network. For a test line, a fully convolution network (FCN) is used to extract features which is fed into LSTMS. The FCN model consisted of 18 convolution layers and 3 max-pool layers. For a test line with size 56*320, the last feature map of FCN has a size of 42*(56/8)*(320/8). Then the channels are concatenated, results in a 294*40 feature map. There are 2-layer bidirectional LSTM after FCN network, the predicted distributions is fed into Connectionist Temporal Classification (CTC) layer [1]. The proposed model is similar to CRNN that proposed by [1]. The differences are that our model can treat sequences in arbitrary lengths, the width of the image is rescaled according to aspect ratio. The model is trained based on 150W lines that labeled by human. It takes about 20 epoches to reach a good model using a single NVIDIA(R) Tesla(TM) M40 GPU.
[1] A. Graves, S. Fernandez, F. J. Gomez, and J. Schmidhu- ´ ber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, 2006. 4, 5
[2]B. Shi, X. Bai, C. Yao, An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, in: CoRR, 2015.