method: DeepSE End-to-End Text Detection and Recognition Model2023-04-02

Authors: Wenzhe Hu*, Hongtao Wen*, Siyuan Zhou, Qingwen Bu, Yichuan Cheng, Minbin Huang

Affiliation: DeepSE x Upstage HK

Description: Note: '*' in 'authors' denotes equal contributions.

For Detection:
We train DBNet as the scene text detector, predicting a set of detection boxes.
Concretely, we leverage oCLIP pre-trained Swin Transformer-Base model as the backbone to directly predict at three different levels. Following DBNet, we employ Balanced Cross-Entropy for the binary map and L1 loss for the threshold map. We further fine-tune the model with Lovasz loss for finer localization.

For Recognition
Training stage
1. We crop the annotated text instances from training data as the training set for our scene text recognition model;
2. We train PARSeq as the recognition model.

Inference stage
1. We crop the detection boxes from test data;
2. We use the trained PARSeq model to recognize texts;
3. We merge recognition results into detection results.

A critical technical trick:
In order to make the data domain consistent in the training and inference stages, we run our DBNet detector on training data, and then replace annotated boxes with detected boxes. This step does not impair the fidelity of the data, while adapting the training domain to the inference domain. This trick essentially improves our model's performance.