method: ssbaseline2020-09-09
Authors: Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu
Affiliation: Northwestern Polytechnical University
Email: zephyrzhuqi@gmail.com
Description: We wish this work to set the new baseline for these two OCR text related applications and to inspire new thinking of multi-modality encoder design.
method: Focus: A bottom-up approach for Scene Text VQA2019-04-29
Authors: Shailza Jolly* (TU Kaiserslautern & DFKI, Kaiserslautern), Shubham Kapoor* (Fraunhofer IAIS, Germany), Andreas Dengel (TU Kaiserslautern & DFKI, Kaiserslautern) [*equal contribution]
Description: We propose a novel scene text Visual Question Answering architecture called Focus. The proposed architecture uses a bottom-up attention mechanism, via Faster R-CNN (with Resnet 101), to extract the visual features of multiple regions of interests (ROI). The top-down attention on these multiple ROIs is calculated using the question embedding from a GRU encoder network. The attended visual features are fused with the question embedding to generate a joint representation of image and question. At last, a GRU based decoder generates open-ended answer sequences conditioned on the joint representation.
Date | Method | Score | |||
---|---|---|---|---|---|
2020-09-09 | ssbaseline | 0.5513 | |||
2019-04-29 | Focus: A bottom-up approach for Scene Text VQA | 0.0800 |