Authors: Wenhan Xian, Kaiyu Zhang, Xuewen Yang, Yuan Lin
Description: Our detection model detect text area by exploring the characters as well as the affinity between the characters. The detection model uses ResNet-50 as the feature extraction backbone. It then uses a U-Net structer to construct the output feature maps of the text regions and the affinity regions. We use weakly supervised learning to estimate the approximate location of the characters and the affinities. Our model can also detect the arbitrarily-shaped texts. In the recognition section, we use ResNet-50 as the backbone for feature extraction, then use Transformer encoding layers as the encoder and Position Attention as the decoder. The model is trained on public large multi-language datasets to achieve good performance. To make the model adaptive to out-of-vocabulary data, we only focus on the vision task and do not include any semantic understanding models in our methods.