method: TH-DL-v22019-06-04

Authors: Ruijie Yan, Linhui Chen, Liangrui Peng, Tsinghua University, Beijing, China

Description: We propose to use multi-task learning method for both script identification and text recognition tasks. A CNN-LSTM network is adopted to extract features for both tasks. A ResNet34 is used for spatial feature extraction, and a 3-layer bidirectional LSTM with 512 units in each layer and each direction is used for sequence modeling. For script identification, a fully connected layer is added for this classification task. For text recognition, a CTC layer is added for decoding. The two tasks are trained jointly with the weighted sum of two loss functions. The weight for the loss function of script identification is 0.5, while the weight for the loss function of text recognition is 1.

Confusion Matrix

Detection
ArabicLatinChineseJapaneseKoreanBanglaHindiSymbolsNone
GTArabic46993526372122230
Latin82595726835928539232090
Chinese539733339021006520
Japanese39188478050763191219280
Korean321482192474107412532140
Bangla21725243022496120
Hindi47004217412610
Symbols18784432300931380
None000000000