Authors: Jan Zdenek, Hideki Nakayama
Description: Our method combines convolutional neural network (CNN) with the conventional bag-of-visual-words approach. A patch-based approach is adopted to solve the issue of variable sizes and aspect ratios of the input images. Individual local patches extracted from training image data are used to train a CNN with six convolutional layers.
Feature vectors of all patches from each training image are extracted by feeding them into the trained CNN and collecting the output from the penultimate layer of the network. Random combinations of three feature vectors are created to form local convolutional triplets and the three feature vectors in each triplet are added up. The local convolutional triplets are used to create a bag-of-visual-words vocabulary with the size of 1024 codewords. Each image is then represented as a vector of codewords which correspond to the local convolutional triplets created from local patches of each image. The codeword vectors are aggregated into histograms of occurrences, which are used for global classification of each image. Multi-layer perceptron with two hidden layers and Dropout between each layer is used for the final classification.