Authors: Fengren Wang, iFLYTEK, firstname.lastname@example.org; Jinshui Hu, iFLYTEK, email@example.com; Jun Du, USTC, firstname.lastname@example.org; Lirong Dai, USTC, email@example.com; Jiajia Wu, iFLYTEK, firstname.lastname@example.org
Description: An ED model for ST-VQA
1. We use OCR and object detection models to extract text and objects from images.
2. Then We use Bert to encode the extracted text and QA pairs.
3. Finally We use a similar model of Bottom-Up and Top-Down to handle the image and question input and give the answer output.