Method: USTB-TVQA - Task 3 - Open Dictionary - ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering

method: USTB-TVQA2019-04-29

Authors: USTB-PRIR (Zan-Xia Jin, Heran Wu, Lu Zhang, Bei Yin, Jingyan Qin,Xu-Cheng Yin)

Description: For ST-VQA task, VQA models performs bad because they can not read. To equip the VQA model read ability, we add OCR (Optical Character Recognition) information to the model. This is a two-stage method. Firstly, we employ an image feature extraction model to get image features and an OCR model to capture the text in the image. All the information is used as input of second stage. Secondly, The model can be divided into three components: the first one gets question representation and image features representation as input to get the first joint representation , the second one gets question representation and OCR representation as input to get second joint representation, the last one is to combine two joint representation to reason and infer an answer.