Authors: Jianzhong Xu, Hailong Wang，Long Ma
Description: Our method is based on crnn framework. We use SE-ResNet with multi-scale feature as the backbone, the extracted feature is fused based on a two-layer transformer unit. Meanwhile, we introduce squeeze-and-excitation and relative position encodings to transformer. Our training datasets consist 20 million samples, including ReCTS, Art. Model with the same architecture have been deployed online for Sogou Input Text Scanning.