Results - ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering

method: ssbaseline2020-09-09

Authors: Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu

Affiliation: Northwestern Polytechnical University

Description: We wish this work to set the new baseline for these two OCR text related applications and to inspire new thinking of multi-modality encoder design.

https://arxiv.org/abs/2012.05153

Source code

method: Focus: A bottom-up approach for Scene Text VQA2019-04-29

Authors: Shailza Jolly* (TU Kaiserslautern & DFKI, Kaiserslautern), Shubham Kapoor* (Fraunhofer IAIS, Germany), Andreas Dengel (TU Kaiserslautern & DFKI, Kaiserslautern) [*equal contribution]

Description: We propose a novel scene text Visual Question Answering architecture called Focus. The proposed architecture uses a bottom-up attention mechanism, via Faster R-CNN (with Resnet 101), to extract the visual features of multiple regions of interests (ROI). The top-down attention on these multiple ROIs is calculated using the question embedding from a GRU encoder network. The attended visual features are fused with the question embedding to generate a joint representation of image and question. At last, a GRU based decoder generates open-ended answer sequences conditioned on the joint representation.

Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

Source code

Ranking Table

Description Paper Source Code

Date				Method	Score
2020-09-09				ssbaseline	0.5490
2019-04-29				Focus: A bottom-up approach for Scene Text VQA	0.2959

Inactive evaluations

method: ssbaseline2020-09-09

method: Focus: A bottom-up approach for Scene Text VQA2019-04-29

Ranking Table

Ranking Graphic