method: Foo & Bar2017-07-01

Authors: Zheqi He, Yongtao Wang

Description: An improvement of Faster RCNN to meet the requirement of detecting quadrilateral object like text. The bounding box regression layer is replaced with a quadrangle regression layer, and the regression target and the loss function are modified accordingly. ResNet-152 is used as base net. To incorporate ResNet-152 and Faster R-CNN, conv4 x and conv5 x are disconnected from ResNet-152, the downsampling of conv4 x is removed and the region
proposal network (RPN) and RoIPooling are inserted between them. The network first processes the whole image to produce a convolutional feature map. This map is used as the input of RPN to generate regions of interest (RoIs), each with an objectness score. These RoIs and the feature map generated by conv4 x are fed to the RoiPooling layer in order to get the fixed-size feature map. This feature map is fed to several convolutional layers
(conv5 x). Conv5 x and layers after it play the roles of fully connected layers commonly seen in VGG networks, they calculate the feature map of each RoI and these feature map is pooled by a global average pooling (GAP). Finally, the output of GAP is fed into two sibling output layers: a classification layer to get the label of each ROI, and a quadrangle
regression layer that outputs 8 real-valued numbers for each RoI, each set of 8 values encodes the coordinates of the vertices of the text region. The method is implemented under TensorFlow. The detection network is pre-trained on imagenet, no any other additional data was used.