Method: VGGMaxBBNet (055) - Task 4 - End-to-End - Focused Scene Text

method: VGGMaxBBNet (055)2015-04-03

Authors: Ankush Gupta, Max Jaderberg, Andrew Zisserman

Description: This system (named VGGMaxBBNet) uses region proposal mechanism for detection [1] and deep convolutional neural networks for recognition [2]. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast subsequent filtering stage for improving precision.

The region proposals are generated from two detection mechanisms -- (1) Edge Boxes region proposal algorithm and a (2) weak Aggregate Channel Features detector. The high recall proposals (about 2000 per image) are subsequently filtered to improve precision using a text/no-text classifier, and regressed to improve their overlap with the words. This results in around a few hundred per image.

Words are then recognized from the text image regions by first scoring them separately with two different deep convolutional neural networks (CNNs) based on -- (1) 90k fixed lexicon encoding, and (2) character sequence encoding [2]. The scores from the fixed lexicon encoding are used to filter overlapping proposal regions through non-maximal suppression.

A final recognition is made through the consensus of the annotations and recognition scores obtained using these two neural networks. The consensus depends on five factors: (1) the two recognition scores, (2) the edit-distance between the annotations, and (3) the respective minimum edit-distances of the two annotations to the per-image lexicon. Using two different recognition CNNs is a refinement of the procedure specified in [1] made specifically for the Robust Reading Challenge. Two recognition CNNs are used because of their complementary nature -- fixed lexicon encoding is more robust than character sequence encoding while the latter can capture out-of-lexicon and numeric text.

References:

1) M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman
Reading Text in the Wild with Convolutional Neural Networks
arXiv preprint arXiv:1412.1842 (2014)

2) M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman
Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition
Workshop on Deep Learning, NIPS, 2014