- Task 1 - Strongly Contextualised - Method: Focus: A bottom-up approach for Scene Text VQA
- Method info
- Samples list
- Per sample details
method: Focus: A bottom-up approach for Scene Text VQA2019-04-29
Authors: Shailza Jolly* (TU Kaiserslautern & DFKI, Kaiserslautern), Shubham Kapoor* (Fraunhofer IAIS, Germany), Andreas Dengel (TU Kaiserslautern & DFKI, Kaiserslautern) [*equal contribution]
Description: We propose a novel scene text Visual Question Answering architecture called Focus. The proposed architecture uses a bottom-up attention mechanism, via Faster R-CNN (with Resnet 101), to extract the visual features of multiple regions of interests (ROI). The top-down attention on these multiple ROIs is calculated using the question embedding from a GRU encoder network. The attended visual features are fused with the question embedding to generate a joint representation of image and question. At last, a GRU based decoder generates open-ended answer sequences conditioned on the joint representation.