Authors: Jan Zdenek, Hideki Nakayama
Description: A CNN is combined with the bag-of- visual-words approach. A patch-based approach is adopted to solve the issue of variable sizes and aspect ratios of the input images. Individual local patches extracted from training image data are used to train the CNN with 6 convolutional layers. Feature vectors of all patches from each training image are fed to the trained CNN and the output is extracted from the penultimate layer of the network. Random combinations
of feature vectors are created to form local convolutional triplets and the 3 vectors in each triplet are added. The local convolutional triplets are used to create a bag-of-visual-words vocabulary with the size of 1024 codewords. Each image is then represented as a vector of codewords which are then aggregated into histograms of occurrences. The histograms are used for global representation of each image. An MLP with two hidden layers and a “Dropout” after each layer is used for the final classification.