method: Upstage KR2023-03-30

Authors: Dahyun Kim, Yunsu Kim, Seung Shin, Bibek Chaudhary, Sanghoon Kim, Sehwan Joo

Affiliation: Upstage

Description: For Task 2, we use a cascade approach where the pipeline is broken up into 1) text detection and 2) text recognition. For text detection, we use the Task 1 methodology. For text recognition, we use the ParSeq [1] architecture with the visual feature extractor changed to SwinV2 [2].

[1] Bautista, D., & Atienza, R. (2022, October). Scene text recognition with permuted autoregressive sequence models. In ECCV 2022
[2] Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., ... & Guo, B. (2022). Swin transformer v2: Scaling up capacity and resolution. In CVPR