method: TH-ML2019-05-28

Authors: Linhui Chen, Liangrui Peng, Tsinghua University, Beijing, China

Description: A 25-layer PyramidNet is used for feature extraction, which is based on the ResNet architecture but gradually increases the feature map dimension at all units instead. The feature maps are then fed into a Transformer for learning spatial dependencies, which adopts 8-head self-attention sub-layers in the encoder of 6 layers. Average pooling is used to reduce the spatial dimension of the output feature maps of the Transformer before the final fully connected layer. The softmax cross-entropy loss is utilized. We propose a script-oriented data augmentation method, where the training samples of certain scripts are randomly transposed. The input images are resized into 128x256 pixels for only a minority of them are of low aspect ratio. A text line recognition model is also combined to adjust the above low-score results using a voting mechanism over the predicted characters. This model is composed of a basic CNN and a Transformer, and trained with the CTC loss.

Confusion Matrix

Detection
ArabicLatinChineseJapaneseKoreanBanglaHindiSymbolsNone
GTArabic4654364112440124330
Latin19359098177325394100782720
Chinese1941933887291552010100
Japanese942013135140774992961330
Korean691661357324104257456260
Bangla8189132136215012350
Hindi58103945407830
Symbols3273012234011131660
None000000000