method: TH-ML2019-05-28
Authors: Linhui Chen, Liangrui Peng, Tsinghua University, Beijing, China
Description: A 25-layer PyramidNet is used for feature extraction, which is based on the ResNet architecture but gradually increases the feature map dimension at all units instead. The feature maps are then fed into a Transformer for learning spatial dependencies, which adopts 8-head self-attention sub-layers in the encoder of 6 layers. Average pooling is used to reduce the spatial dimension of the output feature maps of the Transformer before the final fully connected layer. The softmax cross-entropy loss is utilized. We propose a script-oriented data augmentation method, where the training samples of certain scripts are randomly transposed. The input images are resized into 128x256 pixels for only a minority of them are of low aspect ratio. A text line recognition model is also combined to adjust the above low-score results using a voting mechanism over the predicted characters. This model is composed of a basic CNN and a Transformer, and trained with the CTC loss.
Confusion Matrix
Detection | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Arabic | Latin | Chinese | Japanese | Korean | Bangla | Hindi | Symbols | None | ||
GT | Arabic | 4654 | 364 | 11 | 24 | 40 | 12 | 4 | 33 | 0 |
Latin | 193 | 59098 | 177 | 325 | 394 | 100 | 78 | 272 | 0 | |
Chinese | 19 | 419 | 3388 | 729 | 155 | 20 | 10 | 10 | 0 | |
Japanese | 94 | 2013 | 1351 | 4077 | 499 | 29 | 61 | 33 | 0 | |
Korean | 69 | 1661 | 357 | 324 | 10425 | 74 | 56 | 26 | 0 | |
Bangla | 8 | 189 | 13 | 21 | 36 | 2150 | 123 | 5 | 0 | |
Hindi | 5 | 81 | 0 | 3 | 9 | 45 | 4078 | 3 | 0 | |
Symbols | 32 | 730 | 12 | 23 | 40 | 1 | 11 | 3166 | 0 | |
None | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |