Authors: Honbin Sun, Xiaomeng Song, Xiaoyu Yue, Youjiang Xu, Zhanghui Kuang SenseTime Group
Description: We propose an end-to-end model for multi-oriented scene text recognition. Our model is composed of a 31-layer ResNet, a GRU-based encoder-decoder framework and a 2-dimensional attention module. Specifically, the ResNet is used to extract CNN feature maps for input images, a 2-layer GRU is used to receive one column or one row of the 2D feature maps followed by max-pooling alone the vertical or horizontal axis, another 2-layer GRU is used for text classification at each step. More importantly, we use a tailored 2D attention mechanism to focus text-relevant regions in feature maps based on current hidden state which contains semantic information of previous decode results at each decode step.