method: CNN-LSTM based text recognition2017-06-30

Authors: Lulu Xu

Description: The detection results are achieved by a combination of convolutional and recurrent net. The feature map are generated by a full convolution network (FCN) based on vgg16 models which consist of 14 convolutional layers divided into 5 stages. The full convolutional network detects a text line by sliding a certain size of window in the last convolutional feature maps of FCN and outputs a sequence of a fixed-width text proposals densely. Then the sequential text line proposals are connected by a recurrent neural network. We exploit the long short-term memory (LSTM) architecture for the RNN layer. With an image post processing of the false positive proposes removal and component grouping, the final detections are achieved. For text line recognition, we implement an optimized convolutional recurrent neural network which was first proposed by Baoguang Shi[1]. The proposed network can handle sequences in arbitrary lengths. In CRNN model, the feature extraction part is fully convolution network (FCN) based on a simplified Inception-Renset network which was build by 20 convolution layers and 3 max-pool layers. The FCN network is followed by recurrent neural network which was build by 4 LSTMs, two forward and two backward. RNN predicts each column of the last feature map in FCN, the predicted distributions is fed into Connectionist Temporal Classification (CTC) layer.