method: BOE_AIoT_CTO2020-08-10

Authors: Guangwei Huang, Yue Li, Xiaojun Tang

Description: Our model trained PANNet and multi oriented corner text detectors, and ensemble multi-scale images detection results. Besides, we pre-process the training and testing images to make them clear. Multi-scale training, training data augment are used.

method: H&H Lab2019-04-22

Authors: HUST_VLRGROUP(Mengde Xu, Zhen Zhu,Hui Zhang, Mingkun Yang, Jiehua Yang) & HUAWEI_CLOUD_EI(Jing Wang, Yibin Ye, Shenggao Zhu, Dandan Tu)

Description: we ensemble EAST and multi oriented corner to create a robust scene text detector. To make network learning easier, we modified the mutli-oriented corner network with a new branch borrowed from east added.

method: A modified CTPN model 2.02022-05-09

Authors: Njoyim Tchoubith Peguy Calusha

Affiliation: University of Fribourg, Switzerland

Email: pegpeg07@hotmail.com

Description: The novel Connectionist Text Proposal Network (CTPN) published by Tian, Zhi, et al. develops a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, improving localization accuracy. Originally created to tackle the scene text detection (ICDAR 2013 & 2015), the following enhancement has been made to tackle the scanned receipt text localization:

- In the original CTPN architecture, there are not any interactions between the localization and confidence layers. The output feature map of the localization layer has been incorporate into the computation of the confidence layer, making it focus more on meaningful regions.

- Due to high positive and negative Jaccard overlap (0.7 and 0.5 respectively), the anchor matching strategy fails to match each and every ground truth boxes. Thus the average number of matched anchors are low. To fix this, the positive Jaccard overlap is decreased from 0.7 to 0.5 and from 0.5 to 0.3 for negative Jaccard overlap.

- The regression loss used in the CTPN is the smooth L1 loss. Altough it is a good loss, it is not free from outliers. That is why the balanced L1 loss was used.

- Because of the imbalance between the number of positive and negative anchors, λ1 from the regression loss is set to 4 to balance the loss terms.

- The number of channels of the RPN layer (the one that slides through the last convolutional maps conv5 of the VGG16 model) is 256 instead of 512. This helps in setting large image size during training and localize texts well.

- The negative and positive ratio was changed from 1:1 to 3:1. It was found that this leads to faster optimization and a more stable training.

As most of the scanned receipts contains dominant white space which makes it difficult to localize text properly, the following crop preprocessing has been made:

1) Otsu's binarization (by using Sobel gradient)
2) Morphological operations (Structuring elements, MorphologyEx, Dilate, Erode)
3) Contour following

In addition to the normal post-processing (non max-suppression), the empty boxes are removed based on the average white pixel intensity.

Ranking Table

Description Paper Source Code
DateMethodRecallPrecisionHmean
2020-08-10BOE_AIoT_CTO98.76%98.92%98.84%
2019-04-22H&H Lab97.93%97.95%97.94%
2022-05-09A modified CTPN model 2.097.52%97.40%97.46%
2021-10-22A modified CTPN model 1.097.16%97.10%97.13%
2019-04-22GREAT-OCR Shanghai University96.62%96.21%96.42%
2019-04-21IFLYTEK-textDet_v393.77%95.89%94.81%
2019-04-22A Single-Shot Model for Robust Text Localization93.93%94.80%94.37%
2019-04-19BiLSTM Based on CTPN91.40%94.03%92.69%
2019-04-17EAST_clip_enhance_896_giou89.69%93.77%91.68%
2019-04-17Textline detection89.85%92.72%91.26%
2019-04-20A Text Localization Method Based on CTPN85.23%88.73%86.94%
2019-04-16Vsdnu85.07%87.17%86.11%
2019-04-17scene text detection weapon49.61%64.75%56.18%

Ranking Graphic