Authors: Asaf Gendler, Sharon Fogel, Inbal Lavi, Yair Kittenplon, Alona Golts, Shahar Tsiper
Affiliation: AWS AI Labs
Description: We suggest a hierarchical detection approach based on a Deformable Detr architecture with two consecutive decoders: a line decoder and word decoder. The line decoder focuses on detecting the lines in the image, while the word decoder learns to extract all the words from a given line. Our algorithm can correctly identify the number of words within each line, and predict them in the correcr order.
In addition, by predicting an arbitrary number of word detections from a single line query, we are able to detect thousands of text instances in a given image while operating on only a few hundered queries, reducing computational complextiy.
Finally, a paragraph head based on an affinity matrix learns how to cluster the detected lines into groups, thus giving us paragraph predictions as well in a single end-to-end operation.