Authors: Shangbang Long, Yushuo Guan, Bingxuan Wang, Kaigui Bian, Cong Yao
Email: email@example.com, firstname.lastname@example.org
Description: Our method is comprised of a detection module that localizes the target text with a bounding polygon, a spatial transformer network layer to relocate and rectify the text, and an attentional RNN module to decode the text. The localization process is performed on image crops, whose long sides are resized to a fixed length, and short sides adjusted to maintain original aspect ratio. The rectification module then rectifies the input images, before feeding into the RNN module.
The main idea is derived from our review and analysis of existing methods. We determine that attentional RNN is a key module to recognize words accurately, while text localization in cropped images are relatively easy and reliable. Then we combine them to make our model. Experimentally, we found the localization module good enough.
Note that we only focused on the Latin recognition part.