method: BLCT2017-07-02

Authors: Jan Zdenek, Hideki Nakayama

Description: Our method combines convolutional neural network (CNN) with the conventional bag-of-visual-words approach. A patch-based approach is adopted to solve the issue of variable sizes and aspect ratios of the input images. Individual local patches extracted from training image data are used to train a CNN with six convolutional layers.
Feature vectors of all patches from each training image are extracted by feeding them into the trained CNN and collecting the output from the penultimate layer of the network. Random combinations of three feature vectors are created to form local convolutional triplets and the three feature vectors in each triplet are added up. The local convolutional triplets are used to create a bag-of-visual-words vocabulary with the size of 1024 codewords. Each image is then represented as a vector of codewords which correspond to the local convolutional triplets created from local patches of each image. The codeword vectors are aggregated into histograms of occurrences, which are used for global classification of each image. Multi-layer perceptron with two hidden layers and Dropout between each layer is used for the final classification.

Confusion Matrix

Detection
ArabicLatinChineseJapaneseKoreanBanglaSymbolsMixedNone
GTArabic46114381746233400
Latin19158917356604322727500
Chinese104343445787711200
Japanese23255310004261293121500
Korean322990404602893225700
Bangla22063844272227100
Symbols28147422146293179400
Mixed000000000
None000000000