method: Hi-VT52023-03-28
Authors: Rubèn Tito
Affiliation: Computer Vision Center (CVC)
Email: rperez@cvc.uab.cat
Description: Hierarchical visual T5 (Hi-VT5) consists of a T5 with spatial and visual features, where each page is passed through the VT5 encoder along with the question, and the main information is embedded into the [PAGE] tokens. Then, all [PAGE] tokens of all the document pages are sent to the VT5 decoder, which will generate the answer in an auto-regressive style. Moreover, it has a page prediction module which also receives the [PAGE] tokens at the encoder's output and predicts in which page the information to answer the question was located.