method: Upstage KR2023-04-01

Authors: Dahyun Kim, Yunsu Kim, Seung Shin, Bibek Chaudhary, Sanghoon Kim, Sehwan Joo

Affiliation: Upstage

Description: For Task 2, we use a cascade approach where the pipeline is broken up into 1) text detection and 2) text recognition. For text detection, we use the Task 1 methodology. For text recognition, we use the ParSeq [1] architecture with the visual feature extractor changed to SwinV2 [2].
We pretrain the text recognizer with synthetic data before we fine-tune it on the HierText dataset. We use an in-house synthetic data generator derived from the open source SynthTiger [3] to generate word images using English and Korean corpus. We generate 10M English/Korean word images with horizontal layout and 5M English/Korean word images with vertical layout. For the final submission, we use an ensemble of three text recognizers for strong and stable performance.

[1] Bautista, D., & Atienza, R. (2022, October). Scene text recognition with permuted autoregressive sequence models. In ECCV 2022
[2] Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., ... & Guo, B. (2022). Swin transformer v2: Scaling up capacity and resolution. In CVPR
[3] Yim, M., Kim, Y., Cho, H.C. and Park, S., 2021. SynthTIGER: synthetic text image GEneratoR towards better text recognition models. In ICDAR 2021