method: ClusterFlow2023-03-28

Authors: Anthony Sherbondy, Renshen Wang

Affiliation: Google


Description: ClusterFlow is especially designed to address the problem of extracting text from videos as presented in the RoadText1k dataset. The main motivation is to demonstrate the utility of combining commodity algorithms for OCR, optical flow, clustering and classification with decision trees.

First, we use a public cloud API for extracting OCR results at line-level granularity on every image frame (~300) of each video. Next, we use a modern RAFT implementation to calculate a dense optical flow field at the pixel level for every image. The optical flow field is then used to extrude the OCR line results temporally to create tubes or tracklets of lines. Next, an unsupervised clustering algorithm is used to group the line text tracklets into clusters across the entire video. The distance metric between tracklets, clustering algorithm and hyperparameters for the clustering algorithm is searched on the training dataset.

Given the clustered tracklets, the algorithm then selects geometry and text from the tracklet to create tracked lines that have at most a single appearance within any video frame. To do this a set of features are generated from each line appearance, tracklet and cluster and input into a classification algorithm. The classification algorithm is trained to select the appearances of the cluster that would match with groundtruth in the training set. At inference the classification probabilities are used to select amongst possible line text appearances within a cluster at any video frame.

method: TH-DN2023-03-28

Authors: Ning Ding, Kemeng Zhao, Gang Yao, Pei Tang, Haodong Shi, Liangrui Peng

Affiliation: Tsinghua University


Description: The TH-DN method includes detection, tracking and recognition modules. For detection, YOLOX[1] with ResNet50 backbone is employed. For multi-object tracking, ByteTrack[2] is used with additional supports for low score detection boxes, which utilizes the similarities with tracklets to recover true objects and filter out the background detections. For recognition, an encoder-decoder architecture is adopted. The backbone is a variant of ResNet. The encoder is a bi-directional LSTM network, and the decoder is a Transformer module.

[1] Zheng G, Liu S, Wang F, et al. YOLOX: Exceeding YOLO Series in 2021. arXiv preprint arXiv:2107.08430, 2021. [2] Zhang Y, Sun P, Jiang Y, et al. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In ECCV, 2022: 1-21.

method: TencentOCR V52023-03-28

Authors: Author/s: Haoxi Li, Weida Chen, Huiwenshi, Sicong Liu, Fan Yang,Huiwen Shi, Lifu Wang, Qingxiang Lin,Huiwen Shi,Yuxin Wang,Mei Jiang, Jing Lv, Chunchao Guo, Hongfa Wang, Dapeng Tao, Wei Liu

Affiliation: TencentOCR

Description: We integrated the detection results of DBNet and Cascade MaskRCNN built with multiple Backbone architectures, combined with the Parseq English recognition model for recognition, and further improved the end-to-end tracking with ByteTrack. As a result, we obtained end-to-end tracking and trajectory recognition results.

Ranking Table

Description Paper Source Code
DateMethodMOTAMOTPIDF1Mostly MatchedPartially MatchedMostly Lost
2023-03-28TencentOCR V5-15.360150.73%15.58%2064124053
2023-03-21TencentOCR V1-16.841855.96%16.07%2434474172
2023-03-27TencentOCR V4-28.769564.41%24.79%5558623566
2023-03-27solar flare-60.476360.87%7.28%2096514132
2023-03-24RoadText DRTE-61.392165.47%12.08%1468234020

Ranking Graphic

Ranking Graphic