Results - ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering

method: ssbaseline2020-09-09

Authors: Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu

Affiliation: Northwestern Polytechnical University

Description: We wish this work to set the new baseline for these two OCR text related applications and to inspire new thinking of multi-modality encoder design.

https://arxiv.org/abs/2012.05153

Source code

method: SMA2020-03-06

Authors: Anonymous

Affiliation: Anonymous

Description: Structured Multimodal Attentions

method: VTA2019-04-30

Authors: Fengren Wang, iFLYTEK, frwang@iflytek.com; Jinshui Hu, iFLYTEK, jshu@iflytek.com; Jun Du, USTC, jundu@ustc.edu.cn; Lirong Dai, USTC, lrdai@ustc.edu.cn; Jiajia Wu, iFLYTEK, jjwu@iflytek.com

Description: An ED model for ST-VQA
1. We use OCR and object detection models to extract text and objects from images.
2. Then We use Bert to encode the extracted text and QA pairs.
3. Finally We use a similar model of Bottom-Up and Top-Down[1] to handle the image and question input and give the answer output.

[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018. 1, 2, 7

Ranking Table

Description Paper Source Code

Date	Method	Score
2020-09-09	ssbaseline	0.5490
2020-03-06	SMA	0.5081
2019-04-30	VTA	0.5063
2021-08-15	ss1.0	0.5045
2020-05-22	RUArt	0.4817
2019-04-29	USTB-TQA	0.4553
2019-04-29	Focus: A bottom-up approach for Scene Text VQA	0.2959
2019-04-30	Visual Question Answering via deep multimodal learning	0.1411
2019-04-29	USTB-TVQA	0.1243
2019-04-29	TRAN MINH TRIEU	0.0545

Inactive evaluations

method: ssbaseline2020-09-09

method: SMA2020-03-06

method: VTA2019-04-30

Ranking Table

Ranking Graphic