method: e2e text spotter - final version2022-07-21

Authors: Taeho Kil, Seonghyeon Kim, Sukmin Seo

Affiliation: Clova AI OCR Team, NAVER/LINE Corp.

Description: An end-to-end scene text spotter based on CNN backbone, deformable transformer encoder, location decoder and text decoder. The location decoder based on the segmentation method (Differentiable Binarization) detects text regions, and text decoder based on the deformable transformer decoder recognizes each instances from image features and detected location information. We use not multiple ensemble models but a single model, and all sub-modules are end-to-end trainable. We use real datasets provided by this challenge (train + val split), and synthetic dataset. Since cocotext dataset has a lots of label noises (with regards to alphabet capitalization), we refined the cocotext dataset annotation using teacher model (trained without cocotext).