method: VTA2019-04-30

Authors: Fengren Wang, iFLYTEK, frwang@iflytek.com; Jinshui Hu, iFLYTEK, jshu@iflytek.com; Jun Du, USTC, jundu@ustc.edu.cn; Lirong Dai, USTC, lrdai@ustc.edu.cn; Jiajia Wu, iFLYTEK, jjwu@iflytek.com

Description: An ED model for ST-VQA
1. We use OCR and object detection models to extract text and objects from images.
2. Then We use Bert to encode the extracted text and QA pairs.
3. Finally We use a similar model of Bottom-Up and Top-Down[1] to handle the image and question input and give the answer output.

method: USTB-TQA2019-04-29

Authors: USTB-PRIR (Zan-Xia Jin, Heran Wu, Lu Zhang, Bei Yin, Jingyan Qin,Xu-Cheng Yin)

Description: This is an NLP-QA based method for ST-VQA. Generally, the VQA models only include shallow NLP processes, they can’t understand semantic information of the question completely. In our model, we totally consider the ST-VQA as a QA task in NLP. Firstly, we employ the pre-trained OCR (Optical Character Recognition) and OD (Object Detection) models to obtain the text information of ST-VQA datasets. Secondly, the OCR and OD results are used as the input of our method. Each of them goes through a sub-network including RNN layers and attention layers separately, while share the same parameters. Then we conduct attention from OD representation to OCR representation. Finally, we predict the answer with a high-level question representation and final OCR representations.

Authors: Shailza Jolly* (TU Kaiserslautern & DFKI, Kaiserslautern), Shubham Kapoor* (Fraunhofer IAIS, Germany), Andreas Dengel (TU Kaiserslautern & DFKI, Kaiserslautern) [*equal contribution]

Description: We propose a novel scene text Visual Question Answering architecture called Focus. The proposed architecture uses a bottom-up attention mechanism, via Faster R-CNN (with Resnet 101), to extract the visual features of multiple regions of interests (ROI). The top-down attention on these multiple ROIs is calculated using the question embedding from a GRU encoder network. The attended visual features are fused with the question embedding to generate a joint representation of image and question. At last, a GRU based decoder generates open-ended answer sequences conditioned on the joint representation.

Ranking Table

Description Paper Source Code
DateMethodScore
2019-04-30VTA0.5063
2019-04-29USTB-TQA0.4553
2019-04-29Focus: A bottom-up approach for Scene Text VQA0.2959
2019-04-30Visual Question Answering via deep multimodal learning0.1411
2019-04-29USTB-TVQA0.1243
2019-04-29TRAN MINH TRIEU0.0545

Ranking Graphic