- Task 1 - Strongly Contextualised - Method: Visual Question Answering via deep multimodal learning
- Method info
- Samples list
- Per sample details
method: Visual Question Answering via deep multimodal learning2019-04-30
Authors: Viviana Beltrán (La Rochelle Uni), Mickael Coustaty (La Rochelle Uni), Nicholas Journet (Université Bordeaux), Juan Caicedo (Broad Institute of MIT and Harvard), Antoine Doucet (La Rochelle Uni)
Description: In this work, we propose a deep learning multimodal model based on three components, the first component is a visual network, composed of a RESNET + 1 FC layer; The second component is a textual network using as initialization the new BERT WORD-sentence text embedding, followed by 2 LSTM units + 1 FC layer; The last component is composed of a fusion layer where the two previous networks are merged, followed by two FC layers and finally, an output layer composed of n-grams representing the answers. Our proposed model explores the performance when the words contained in images are represented by n-grams.