method: VTA2019-04-30

Authors: Fengren Wang, iFLYTEK, frwang@iflytek.com; Jinshui Hu, iFLYTEK, jshu@iflytek.com; Jun Du, USTC, jundu@ustc.edu.cn; Lirong Dai, USTC, lrdai@ustc.edu.cn; Jiajia Wu, iFLYTEK, jjwu@iflytek.com

Description: An ED model for ST-VQA
1. We use OCR and object detection models to extract text and objects from images.
2. Then We use Bert to encode the extracted text and QA pairs.
3. Finally We use a similar model of Bottom-Up and Top-Down[1] to handle the image and question input and give the answer output.