method: TH-DL-v12019-06-03

Authors: Ruijie Yan, Linhui Chen, Liangrui Peng, Tsinghua University, Beijing, China

Description: We propose to use multi-task learning method for both script identification and text recognition tasks. A CNN-LSTM network is adopted to extract features for both tasks. A ResNet34 is used for spatial feature extraction, and a 3-layer bidirectional LSTM with 512 units in each layer and each direction is used for sequence modeling. For script identification, a fully connected layer is added for this classification task. For text recognition, a CTC layer is added for decoding. The two tasks are trained jointly with the weighted sum of two loss functions. The weight for the loss function of script identification is 0.5, while the weight for the loss function of text recognition is 1.

Confusion Matrix

Detection
ArabicLatinChineseJapaneseKoreanBanglaHindiSymbolsNone
GTArabic47023655262151170
Latin67599207222515833201420
Chinese84243597641716210
Japanese392193108545362441825170
Korean241873251379103774135120
Bangla31678161722736010
Hindi57303217412310
Symbols18972627251629600
None000000000