method: TH-DL2019-05-27
Authors: Ruijie Yan, Linhui Chen, Liangrui Peng, Tsinghua University, Beijing, China
Description: We propose to use multi-task learning method for both script identification and text recognition tasks. A CNN-LSTM network is adopted to extract features for both tasks. A ResNet34 is used for spatial feature extraction, and a 3-layer bidirectional LSTM with 512 units in each layer and each direction is used for sequence modeling. For script identification, a fully connected layer is added for this classification task. For text recognition, a CTC layer is added for decoding. The two tasks are trained jointly with the weighted sum of two loss functions. The weight for the loss function of script identification is 0.5, while the weight for the loss function of text recognition is 1.
Confusion Matrix
Detection | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Arabic | Latin | Chinese | Japanese | Korean | Bangla | Hindi | Symbols | None | ||
GT | Arabic | 4686 | 323 | 3 | 44 | 32 | 3 | 2 | 49 | 0 |
Latin | 154 | 58106 | 53 | 975 | 627 | 64 | 18 | 640 | 0 | |
Chinese | 3 | 300 | 2681 | 1658 | 66 | 3 | 1 | 38 | 0 | |
Japanese | 21 | 1722 | 494 | 5486 | 282 | 12 | 4 | 136 | 0 | |
Korean | 35 | 1701 | 209 | 1355 | 9571 | 24 | 19 | 78 | 0 | |
Bangla | 12 | 170 | 5 | 53 | 15 | 2248 | 39 | 3 | 0 | |
Hindi | 5 | 57 | 1 | 5 | 2 | 47 | 4096 | 11 | 0 | |
Symbols | 18 | 574 | 0 | 16 | 14 | 8 | 3 | 3382 | 0 | |
None | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |