Authors: Yunsu Kim, Seung Shin, Bibek Chaudhary, Sanghoon Kim, Dahyun Kim, Sehwan Joo
Description: In addressing hierarchical text detection, we implement a two-step approach. First, we perform multi-class semantic segmentation where classes are word, line, and paragraph regions. Then, we use the predicted probability map to extract and organize these entities hierarchically. Specifically, we utilize ensemble of UNets with ImageNet-pretrained EfficientNetB7/MitB4 backbones to extract class masks. Connected components are identified in the predicted mask to separate words from each other, same for lines and paragraphs. Then, word_i is assigned as a child of line_j if line_j has the highest IoU with word_i compared to all other lines. This process is similarly applied to lines and paragraphs.
For training, we erode target entities and dillate predicted entities. Also we ensure that target entities maintain a gap between them. We use symmetric Lovasz loss. We use SynthText dataset to pretrain our models.