method: Foo & Bar2017-07-01

Authors: Zheqi He, Yongtao Wang

Description: An improvement of Faster RCNN to meet the requirement of detecting quadrilateral object like text. The bounding box regression layer is replaced with a quadrangle regression layer, and the regression target and the loss function are modified accordingly. ResNet-152 is used as base net. To incorporate ResNet-152 and Faster R-CNN, conv4 x and conv5 x are disconnected from ResNet-152, the downsampling of conv4 x is removed and the region
proposal network (RPN) and RoIPooling are inserted between them. The network first processes the whole image to produce a convolutional feature map. This map is used as the input of RPN to generate regions of interest (RoIs), each with an objectness score. These RoIs and the feature map generated by conv4 x are fed to the RoiPooling layer in order to get the fixed-size feature map. This feature map is fed to several convolutional layers
(conv5 x). Conv5 x and layers after it play the roles of fully connected layers commonly seen in VGG networks, they calculate the feature map of each RoI and these feature map is pooled by a global average pooling (GAP). Finally, the output of GAP is fed into two sibling output layers: a classification layer to get the label of each ROI, and a quadrangle
regression layer that outputs 8 real-valued numbers for each RoI, each set of 8 values encodes the coordinates of the vertices of the text region. The method is implemented under TensorFlow. The detection network is pre-trained on imagenet, no any other additional data was used.

method: SRC-B-MachineLearningLab2017-06-30

Authors: Yingying Jiang, Xiaobing Wang, Xiangyu Zhu, Shuli Yang, Wei Li, Zhenbo Luo

Description: Samsung R&D Institute China - Beijing. Machine Learning Lab. It is based on "R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection (arXiv: 1706.09579)"

method: CCFLAB2017-06-29

Authors: Dai Yuchen

Description: I'm from Shanghai Jiao Tong University. This mothod uses Deformable Convolutional Nets as the base architecture. A resnet-101 is used as the backbone convolutional network for feature extraction. During feature extraction, deformable convolution layers are added to catch the text patterns with deformable convolutional kernels. Then region proposal network, which are 3x3 convolutions, generate regions of interest. Then a deformable ROI pooling layer is used to crop ROIs to fixed- size feature maps. Then these representation of ROIs are sent to the final classification and box-regression branches.

Ranking Table

Description Paper Source Code
DateMethodAverage PrecisionHmeanRecallPrecision
2017-07-01Foo & Bar67.16%5.95%83.66%3.08%
2017-06-29Tencent-DPPR Team & USTB-PRIR61.95%40.30%74.98%27.56%
2017-07-01SCUT-DLVClab-HuangGroup 41.79%22.89%57.53%14.29%
2017-10-11TextFCN V228.34%30.11%43.31%23.08%
2017-06-22CNN-LSTM based text detection6.19%15.92%20.52%13.00%

Ranking Graphic

Ranking Graphic