Authors: Mingkun Yang, Hui Zhang, Zhen Zhu, Mengde Xu, Jiehua Yang, Jing Wang, Yibin Ye, Shenggao Zhu, Dandan Tu
Description: We are from Huazhong University of Science and Technology. We adopt a two-stage method for this e2e task. Specifically, our detection method is modified from Mask TextSpotter  based on the ResNet-50-FPN backbone. We only use its detection part while omitting the text recognition part. We first conduct an aspect ratio clustering in the training set and set the anchor scales for the region proposal network to (0.1, 0.18, 0.25, 0.5, 1.0, 2.0). In order to give high quality proposals, we manipulate Cascade R-CNN  in the network and set the positive IoUs to (0.7, 0.5, 0.6, 0.7) and the negative IoUs to (0.3, 0.5, 0.6, 0.7). We also change the convolutions in the last two stages to modulated deformable convolutions  to enhance model’s ability to capture large or long text instances that widely appear in the dataset. The detection network is trained with the minimum side of the input image set to 1600. We conduct multi-scale testing for better performance at scales (1000, 1200, 1400, 1600, 1800). The final results from multiple scales are obtained by filtering boxes whose scores are under a threshold 0.7 and then through a standard non-maximum suppression method with overlap set to 0.1. In addition, We mainly use CRNN equipped with multipy advanced backbones and some improvements to obtain the final results. To handle some irregular text instances, we add a rectification module before recognition.
 Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, Xiang Bai. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes. ECCV 2018.
 Zhaowei Cai, Nuno Vasconcelos. Cascade R-CNN: Delving Into High Quality Object Detection. CVPR 2018.
 Xizhou Zhu, Han Hu, Stephen Lin, Jifeng Dai. Deformable ConvNets v2: More Deformable, Better Results. CoRR abs/1811.11168 (2018).