Authors: CKD Team(Xiaocong Cai，Wenyang Hu， Jun Hou,，Miaomiao Cheng)
1） The method is designed based on the Rectify-Encoder-Decoder framework.
2） Our training data contains about 5, 600, 000 images from Synth90k, SynthText, SynthAdd and some academic dataset.
3） Varying length input is adopted here and the maximum input size is 64x160. Images are rectified by STN(spatial transform network) firstly. Then the rectified images are passed to some CNN backbones(e.g. ResNet) to extract features. As for the decoder part, we use three kinds of decoders to train different models, including CTC,1D attention,2D attention.Specially, the prediction results of these models are ensembled together.
4） Besides, some data augmentation methods and other tricks are used in this work.