method: ClusterFlow2023-03-28

Authors: Anthony Sherbondy, Renshen Wang

Affiliation: Google

Email: tonysherbondy@google.com

Description: ClusterFlow is especially designed to address the problem of extracting text from videos as presented in the RoadText1k dataset. The main motivation is to demonstrate the utility of combining commodity algorithms for OCR, optical flow, clustering and classification with decision trees.

First, we use a public cloud API for extracting OCR results at line-level granularity on every image frame (~300) of each video. Next, we use a modern RAFT implementation to calculate a dense optical flow field at the pixel level for every image. The optical flow field is then used to extrude the OCR line results temporally to create tubes or tracklets of lines. Next, an unsupervised clustering algorithm is used to group the line text tracklets into clusters across the entire video. The distance metric between tracklets, clustering algorithm and hyperparameters for the clustering algorithm is searched on the training dataset.

Given the clustered tracklets, the algorithm then selects geometry and text from the tracklet to create tracked lines that have at most a single appearance within any video frame. To do this a set of features are generated from each line appearance, tracklet and cluster and input into a classification algorithm. The classification algorithm is trained to select the appearances of the cluster that would match with groundtruth in the training set. At inference the classification probabilities are used to select amongst possible line text appearances within a cluster at any video frame.

method: TH-DN2023-03-28

Authors: Ning Ding, Kemeng Zhao, Gang Yao, Pei Tang, Haodong Shi, Liangrui Peng

Affiliation: Tsinghua University

Email: dn22@mails.tsinghua.edu.cn

Description: The TH-DN method includes detection, tracking and recognition modules. For detection, YOLOX[1] with ResNet50 backbone is employed. For multi-object tracking, ByteTrack[2] is used with additional supports for low score detection boxes, which utilizes the similarities with tracklets to recover true objects and filter out the background detections. For recognition, an encoder-decoder architecture is adopted. The backbone is a variant of ResNet. The encoder is a bi-directional LSTM network, and the decoder is a Transformer module.

[1] Zheng G, Liu S, Wang F, et al. YOLOX: Exceeding YOLO Series in 2021. arXiv preprint arXiv:2107.08430, 2021. [2] Zhang Y, Sun P, Jiang Y, et al. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In ECCV, 2022: 1-21.

method: TencentOCR V52023-03-28

Authors: Author/s: Haoxi Li, Weida Chen, Huiwenshi, Sicong Liu, Fan Yang,Huiwen Shi, Lifu Wang, Qingxiang Lin,Huiwen Shi,Yuxin Wang,Mei Jiang, Jing Lv, Chunchao Guo, Hongfa Wang, Dapeng Tao, Wei Liu

Affiliation: TencentOCR

Description: We integrated the detection results of DBNet and Cascade MaskRCNN built with multiple Backbone architectures, combined with the Parseq English recognition model for recognition, and further improved the end-to-end tracking with ByteTrack. As a result, we obtained end-to-end tracking and trajectory recognition results.
roadtext_text1_3_v5.json

Ranking Table

Description Paper Source Code
DateMethodMOTAMOTPIDF1Mostly MatchedPartially MatchedMostly Lost
2023-03-28ClusterFlow11.088769.04%48.07%13929202668
2023-03-28TH-DN-4.504463.95%31.65%5537243700
2023-03-28TencentOCR V5-15.360150.73%15.58%2064124053
2023-03-21TencentOCR V1-16.841855.96%16.07%2434474172
2023-03-28TH-DL-23.102872.83%37.34%12357373020
2023-03-28TencentOCR-23.866956.19%19.71%3154544102
2023-03-26TransDETR-28.496268.74%26.87%6607413589
2023-03-27TencentOCR V4-28.769564.41%24.79%5558623566
2023-03-24RoadText DRTE-61.392165.47%12.08%1468234020
2023-03-20SCUT-MMOCR-KS-77.191367.83%29.96%11969182878
2023-03-20YBP-102.128767.44%11.64%4067273859
2023-03-20YBP-102.138267.44%11.63%4067273859

Ranking Graphic

Ranking Graphic