- Task 2 - End-to-end on Test Set - Method: DBNet++ and SATRN
- Method info
- Samples list
- Per sample details
method: DBNet++ and SATRN2023-03-30
Authors: Yudong Chen, Yuke Zhu, Yue Zhang, Liang Hu, Sheng Guo
Affiliation: MYBank
Description: We adopt a two-stage pipeline to solve Task2.
For text detection, we use the DBNet++ model, the backbone uses ResNet50-dcnv2, and the neck uses FPNC. For the training data of text detection, we only rely on the 8281 images of the HierText training set, directly use the word-level annotations for training, without using any pretrained model. The optimizer uses Adam, and the initial learning rate is set to 3e-4. We train for 1200 epochs using 8 Nvidia A100 GPUs and the batchsize is set to 16 for each GPU.
For text recognition, we use SATRN-Base model, the backbone uses ShallowCNN, the encoder uses 12-layer SATRNEncoder, and the decoder uses 6-layer NRTRDecoder. We count the occurrences of all characters in the HierText training set, filter out the uncommon characters, and finally build a charset set of 285 characters. We cropped the training set data of HierText, filtered out the data that does not appear in the charset, and obtained 859k text data. We additionally use the Text_Renderer to generate 5 million synthetic data and merge them, adjust the sampling ratio to 5:1 during the training process. The optimizer uses Adam, and the initial learning rate is set to 3e-4. We train for 10 epochs using 8 Nvidia A100 GPUs and the batchsize is set to 12 for each GPU.
For the model inference, we first resize the image so that the short side is not less than 1280 and the long side is not greater than 2560 in the pre-processing process of DBNet++. The text areas detected by DBNet++ are represented by quadrilaterals. During the post-processing of the detection model, the perspective transformation operation is performed on the detected quadrilaterals, and they are mapped to horizontal rectangular images. Then we resize the height of each cropped image to 32, scale the width proportionally, and send it to the SATRN recognition model. The detection and recognition thresholds are both set to 0.95.