Results - ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering

method: ssbaseline2020-09-09

Authors: Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu

Affiliation: Northwestern Polytechnical University

Description: We wish this work to set the new baseline for these two OCR text related applications and to inspire new thinking of multi-modality encoder design.

https://arxiv.org/abs/2012.05153

Source code

method: VTA2019-04-30

Authors: Fengren Wang, iFLYTEK, frwang@iflytek.com; Jinshui Hu, iFLYTEK, jshu@iflytek.com; Jun Du, USTC, jundu@ustc.edu.cn; Lirong Dai, USTC, lrdai@ustc.edu.cn; Jiajia Wu, iFLYTEK, jjwu@iflytek.com

Description: An ED model for ST-VQA
1. We use OCR and object detection models to extract text and objects from images.
2. Then We use Bert to encode the extracted text and QA pairs.
3. Finally We use a similar model of Bottom-Up and Top-Down[1] to handle the image and question input and give the answer output.

[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018. 1, 2, 7

method: Focus: A bottom-up approach for Scene Text VQA2019-04-29

Authors: Shailza Jolly* (TU Kaiserslautern & DFKI, Kaiserslautern), Shubham Kapoor* (Fraunhofer IAIS, Germany), Andreas Dengel (TU Kaiserslautern & DFKI, Kaiserslautern) [*equal contribution]

Description: We propose a novel scene text Visual Question Answering architecture called Focus. The proposed architecture uses a bottom-up attention mechanism, via Faster R-CNN (with Resnet 101), to extract the visual features of multiple regions of interests (ROI). The top-down attention on these multiple ROIs is calculated using the question embedding from a GRU encoder network. The attended visual features are fused with the question embedding to generate a joint representation of image and question. At last, a GRU based decoder generates open-ended answer sequences conditioned on the joint representation.

Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

Source code

Ranking Table

Description Paper Source Code

Date	Method	Score
2020-09-09	ssbaseline	0.5490
2019-04-30	VTA	0.5063
2019-04-29	Focus: A bottom-up approach for Scene Text VQA	0.2959

Inactive evaluations

method: ssbaseline2020-09-09

method: VTA2019-04-30

method: Focus: A bottom-up approach for Scene Text VQA2019-04-29

Ranking Table

Ranking Graphic