method: AAIG-OCR-NLP2023-04-14

Authors: Liu Yang, Yang Fan, Lin Junyu, Tang Bin, Jin Xuan, Yuan Bo, He Yuan, Huang Longtao

Affiliation: Alibaba Artificial Intelligence Governance Research Center (AAIG)

Description: We used a regression-based text detector, a ViT-based text recognizer and a transformer-based NLP semantic correction module to complete End-to-End Text Spotting task. First, we got pre-trained models on training set including LSVT, RCTW, MLT, ArT etc. Then we fine-tuned models on ReCTS training set to obtain final models. We used single scale and no ensemble mechanism to obtain final results.

method: NSTD-MCEM-iFLYTEK2019-10-12

Authors: iFLYTEK

Affiliation: iFLYTEK

Description: Natural scene text detector(NSTD-iFLYTEK) is based on MaskRcnn with resnet-101. Only ICDAR2019 datasets are used for training, including Rects, LSVT, MLT and Art. Multi-scale training and single-scale testing are used to generate the final result, no model ensemble. Recognition ensemble model is based on attention-based text recognizer. The final results are fused with different channel information on different models.
Xiangxiang Wang (王翔翔) iFLYTEK (科大讯飞)
Jian Dong(董健) iFLYTEK(科大讯飞)
Fengren Wang(王烽人) iFLYTEK(科大讯飞)
Jiajia Wu(吴嘉嘉) iFLYTEK(科大讯飞)
Yin Lin(林垠) iFLYTEK(科大讯飞)
Lou Shun(娄舜) iFLYTEK(科大讯飞)
Jinshui Hu(胡金水) iFLYTEK(科大讯飞)

method: Tencent-DPPR Team2019-04-29

Authors: Shangxuan Tian, Haoxi Li, Sicong Liu, Longhuang Wu, Chunchao Guo, Haibo Qin, Chang Liu, Hongfa Wang, Hongkai Chen, Qinglin lu, Xucheng Yin, Lei Xiao

Description: We are Tencent-DPPR (Data Platform Precision Recommendation) team. In detection stage, we use LSVT dataset to pretrain our model and provided ReCTS dataset to train the text detector. During training, we use multi-scale training policy.
Our text detector is based on two-stage method. In backbone part, we use ResNet101 as feature extractor. In FPN part, we designed a policy to help proposals select feature pyramid layers to extract features instead of choosing one layer according to box sizes.
In detection ensemble part, we apply a multi-scale test method with different backones. When ensembling all the results, we develop an approach to vote boxes after scoring each box.
In the recognition stage, we use a synthetic dataset containing more than fifty million images, as well as open-source datasets including LSVT, ReCTS, COCO-Text, RCTW, and ICPR-2018-MTWI. Our data augmentation tricks include Gaussian blur, Gaussian noise and so on. All samples are resized to the same height before feeding into the network.
Five types of deep models are used in our recognition stage, including CTC-based nets and multi-head attention based nets. For task 1, we select the character with the highest frequency among all the results. For task 2 and task 4, we also use the predicted confidence scores of cropped words and the ensemble results to select the reliable one among results predicted by all models.

Ranking Table

Description Paper Source Code
DateMethodRecallPrecisionHmean1-NED
2023-04-14AAIG-OCR-NLP93.11%93.98%93.54%83.60%
2019-10-12NSTD-MCEM-iFLYTEK93.16%93.63%93.40%81.96%
2019-04-29Tencent-DPPR Team92.49%93.49%92.99%81.45%
2019-04-30SANHL_v193.86%91.98%92.91%81.43%
2019-04-26baseline_0.793.62%87.22%90.30%76.60%
2019-04-30Task4-re390.80%90.26%90.53%73.43%
2019-04-30pursuer86.12%92.73%89.30%72.76%
2019-04-30HUST_e2e91.54%90.28%90.91%71.89%
2019-04-29CLTDR88.89%88.92%88.91%71.81%
2019-04-30MCEM v384.64%89.56%87.03%71.10%
2021-09-20ABCNetv287.91%92.89%90.33%63.94%
2019-04-30CRAFT + TPS-ResNet v375.89%78.44%77.14%41.68%

Ranking Graphic

Ranking Graphic