method: Clova DEER2023-04-01

Authors: Song Kayeon, Taeho Kil, Donghyun Kim, Sukmin Seo

Affiliation: Naver Cloud

Description: Our model passes through a CNN and deformable transformer encoder to extract multi-scale visual features for images. Then, an independent segmentation head is utilized to extract words, lines, and paragraphs. Additionally, text recognition results are achieved through a deformable transformer decoder. Our model performs both layout detection and OCR simultaneously. In summary, our single model performs both layout detection (task 1) and OCR (task 2) simultaneously.