Method: sudokill-9 - Task 1 - End-to-End Text Recognition - Out of Vocabulary Scene Text Understanding

method: sudokill-92022-07-21

Authors: sudokill-9

Description: 1. Text detection:
The method is based on DBNet++. We uses a multi-scale input with random-crop size from 320 to 720, and the data set is about 260,000 cleaned data.
2. Text recognition :
A transformer-based framework. The encoder is 12 layer of VIT-based block. The decoder is fusion with results from three types of decoder, which are CTC-based decoder, attention-based decoder and CTC+attention-based decoder. And use confidence bootstrap in post-processing.