method: Visual Question Answering via deep multimodal learning2019-04-30

Authors: Viviana Beltrán (La Rochelle Uni), Mickael Coustaty (La Rochelle Uni), Nicholas Journet (Université Bordeaux), Juan Caicedo (Broad Institute of MIT and Harvard), Antoine Doucet (La Rochelle Uni)

Description: In this work, we propose a deep learning multimodal model based on three components, the first component is a visual network, composed of a RESNET + 1 FC layer; The second component is a textual network using as initialization the new BERT WORD-sentence text embedding, followed by 2 LSTM units + 1 FC layer; The last component is composed of a fusion layer where the two previous networks are merged, followed by two FC layers and finally, an output layer composed of n-grams representing the answers. Our proposed model explores the performance when the words contained in images are represented by n-grams.