Authors: Qingwen Bu, Yichuan Cheng, Minbin Huang
Affiliation: DeepSE X Upstage HK
Description: We fundamentally use DBNet as the scene text detector. We leverage oCLIP pretrained Swin Transformer-Base model as the backbone to directly predict at three different levels. Following DBNet, we employ Balanced Cross-Entropy for binary map and L1 loss for threshold map. We further fine tune the model with lovasz loss for finer localization.