Authors: USTB-PRIR (Zan-Xia Jin, Heran Wu, Lu Zhang, Bei Yin, Jingyan Qin,Xu-Cheng Yin)
Description: This is an NLP-QA based method for ST-VQA. Generally, the VQA models only include shallow NLP processes, they can’t understand semantic information of the question completely. In our model, we totally consider the ST-VQA as a QA task in NLP. Firstly, we employ the pre-trained OCR (Optical Character Recognition) and OD (Object Detection) models to obtain the text information of ST-VQA datasets. Secondly, the OCR and OD results are used as the input of our method. Each of them goes through a sub-network including RNN layers and attention layers separately, while share the same parameters. Then we conduct attention from OD representation to OCR representation. Finally, we predict the answer with a high-level question representation and final OCR representations.