method: Tencent-DPPR Team2019-06-04
Authors: Sicong Liu, Haoxi Li, Haibo Qin, Ben Xu, Chunchao Guo, Longhuang Wu, Shangxuan Tian, Hongfa Wang, Hongkai Chen, Qinglin lu, Chun Yang, Xucheng Yin, Lei Xiao
Description: We are from Tencent-DPPR (Data Platform Precision Recommendation) Team. We first recognize text lines and their character-level language types using ensemble results of several recognition models, which based on CTC/Seq2Seq and CNN with self-attention/RNN. After that, we identify the language types of recognized results based on statics of MLT-2019 and Wikipedia corpus.
Confusion Matrix
Detection | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Arabic | Latin | Chinese | Japanese | Korean | Bangla | Hindi | Symbols | None | ||
GT | Arabic | 5003 | 102 | 7 | 11 | 3 | 3 | 2 | 11 | 0 |
Latin | 224 | 59413 | 278 | 199 | 127 | 47 | 38 | 311 | 0 | |
Chinese | 5 | 30 | 4404 | 288 | 6 | 5 | 7 | 5 | 0 | |
Japanese | 79 | 769 | 1049 | 6075 | 72 | 14 | 47 | 52 | 0 | |
Korean | 114 | 1026 | 299 | 152 | 11239 | 46 | 84 | 32 | 0 | |
Bangla | 11 | 44 | 7 | 4 | 6 | 2442 | 29 | 2 | 0 | |
Hindi | 6 | 29 | 0 | 1 | 0 | 8 | 4178 | 2 | 0 | |
Symbols | 23 | 309 | 34 | 39 | 10 | 2 | 2 | 3596 | 0 | |
None | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |