Method: A modified CTPN model 2.0 - Task 1 - Text Localization - ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction

method: A modified CTPN model 2.02022-05-09

Authors: Njoyim Tchoubith Peguy Calusha

Affiliation: University of Fribourg, Switzerland

Description: The novel Connectionist Text Proposal Network (CTPN) published by Tian, Zhi, et al. develops a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, improving localization accuracy. Originally created to tackle the scene text detection (ICDAR 2013 & 2015), the following enhancement has been made to tackle the scanned receipt text localization:

- In the original CTPN architecture, there are not any interactions between the localization and confidence layers. The output feature map of the localization layer has been incorporate into the computation of the confidence layer, making it focus more on meaningful regions.

- Due to high positive and negative Jaccard overlap (0.7 and 0.5 respectively), the anchor matching strategy fails to match each and every ground truth boxes. Thus the average number of matched anchors are low. To fix this, the positive Jaccard overlap is decreased from 0.7 to 0.5 and from 0.5 to 0.3 for negative Jaccard overlap.

- The regression loss used in the CTPN is the smooth L1 loss. Altough it is a good loss, it is not free from outliers. That is why the balanced L1 loss was used.

- Because of the imbalance between the number of positive and negative anchors, λ1 from the regression loss is set to 4 to balance the loss terms.

- The number of channels of the RPN layer (the one that slides through the last convolutional maps conv5 of the VGG16 model) is 256 instead of 512. This helps in setting large image size during training and localize texts well.

- The negative and positive ratio was changed from 1:1 to 3:1. It was found that this leads to faster optimization and a more stable training.

As most of the scanned receipts contains dominant white space which makes it difficult to localize text properly, the following crop preprocessing has been made:

1) Otsu's binarization (by using Sobel gradient)
2) Morphological operations (Structuring elements, MorphologyEx, Dilate, Erode)
3) Contour following

In addition to the normal post-processing (non max-suppression), the empty boxes are removed based on the average white pixel intensity.

Zhi Tian, Weilin Huang, Tong He, Pan He, Yu Qiao "Detecting Text in Natural Image with Connectionist Text Proposal Network" (2016).

Source code