method: BLCT2017-07-02

Authors: Jan Zdenek, Hideki Nakayama

Description: A CNN is combined with the bag-of- visual-words approach. A patch-based approach is adopted to solve the issue of variable sizes and aspect ratios of the input images. Individual local patches extracted from training image data are used to train the CNN with 6 convolutional layers. Feature vectors of all patches from each training image are fed to the trained CNN and the output is extracted from the penultimate layer of the network. Random combinations
of feature vectors are created to form local convolutional triplets and the 3 vectors in each triplet are added. The local convolutional triplets are used to create a bag-of-visual-words vocabulary with the size of 1024 codewords. Each image is then represented as a vector of codewords which are then aggregated into histograms of occurrences. The histograms are used for global representation of each image. An MLP with two hidden layers and a “Dropout” after each layer is used for the final classification.

Confusion Matrix

Detection
ArabicLatinChineseJapaneseKoreanBanglaSymbolsMixedNone
GTArabic45715071721154700
Latin14359208366374269859200
Chinese25343471683544200
Japanese10288710214020194131200
Korean163319439367881033800
Bangla12163931162241100
Symbols2413682089227196600
Mixed000000000
None000000000