Method: BLCT - Task 2 - Script identification - ICDAR2017 Competition on Multi-lingual scene text detection and script identification

method: BLCT2017-07-02

Authors: Jan Zdenek, Hideki Nakayama

Description: Our method combines convolutional neural network (CNN) with the conventional bag-of-visual-words approach. A patch-based approach is adopted to solve the issue of variable sizes and aspect ratios of the input images. Individual local patches extracted from training image data are used to train a CNN with six convolutional layers.
Feature vectors of all patches from each training image are extracted by feeding them into the trained CNN and collecting the output from the penultimate layer of the network. Random combinations of three feature vectors are created to form local convolutional triplets and the three feature vectors in each triplet are added up. The local convolutional triplets are used to create a bag-of-visual-words vocabulary with the size of 1024 codewords. Each image is then represented as a vector of codewords which correspond to the local convolutional triplets created from local patches of each image. The codeword vectors are aggregated into histograms of occurrences, which are used for global classification of each image. Multi-layer perceptron with two hidden layers and Dropout between each layer is used for the final classification.

Confusion Matrix

		Detection
		Arabic	Latin	Chinese	Japanese	Korean	Bangla	Symbols	Mixed	None
GT	Arabic	4611	438	17	46	23	3	4	0	0
	Latin	191	58917	356	604	322	72	75	0	0
	Chinese	10	434	3445	787	71	1	2	0	0
	Japanese	23	2553	1000	4261	293	12	15	0	0
	Korean	32	2990	404	602	8932	25	7	0	0
	Bangla	2	206	38	44	27	2227	1	0	0
	Symbols	28	1474	22	146	29	3	1794	0	0
	Mixed	0	0	0	0	0	0	0	0	0
	None	0	0	0	0	0	0	0	0	0