method: TransDETR2023-03-26

Authors: Yu Hao, Chuhui Xue, Wenqing Zhang, Song Bai

Affiliation: ByteDance Inc.

Email: jinyu121@gmail.com

Description: The method we use is TransDETR[1]. First, we get the weights pre-trained on the ICDAR2015 video, then use the RoadText3K and BOVText to fine-tune the network for 20 epochs. Finally, we use the RoadText to fine-tune the network for 20 epoch.

[1] End-to-end Video Text Spotting with Transformer
[2] Read while you drive - multilingual text tracking on the road

method: ClusterFlow2023-03-28

Authors: Anthony Sherbondy, Renshen Wang

Affiliation: Google

Email: tonysherbondy@google.com

Description: ClusterFlow is especially designed to address the problem of extracting text from videos as presented in the RoadText1k dataset. The main motivation is to demonstrate the utility of combining commodity algorithms for OCR, optical flow, clustering and classification with decision trees.

First, we use a public cloud API for extracting OCR results at line-level granularity on every image frame (~300) of each video. Next, we use a modern RAFT implementation to calculate a dense optical flow field at the pixel level for every image. The optical flow field is then used to extrude the OCR line results temporally to create tubes or tracklets of lines. Next, an unsupervised clustering algorithm is used to group the line text tracklets into clusters across the entire video. The distance metric between tracklets, clustering algorithm and hyperparameters for the clustering algorithm is searched on the training dataset.

Given the clustered tracklets, the algorithm then selects geometry and text from the tracklet to create tracked lines that have at most a single appearance within any video frame. To do this a set of features are generated from each line appearance, tracklet and cluster and input into a classification algorithm. The classification algorithm is trained to select the appearances of the cluster that would match with groundtruth in the training set. At inference the classification probabilities are used to select amongst possible line text appearances within a cluster at any video frame.

method: TH-DL2023-03-28

Authors: Gang Yao*, Ning Ding*, Kemeng Zhao, Huan Yu, Pei Tang, Haodong Shi, Liangrui Peng [*equal contribution]

Affiliation: Tsinghua University

Email: dn22@mails.tsinghua.edu.cn

Description: The TH-DL method provides an integrated scheme for text detection, recognition, and tracking in driving videos. For text detection and recognition, TESTR[1] based on Transformer is adopted. The pre-trained TESTR model is finetuned on the training set of the Roadtext Challenge. For multi-object tracking, ByteTrack[2] is employed which uses the similarities with tracklets to recover true objects from low score detection boxes. Post-processing module is added to filter duplicate instances of text detection and recognition.

[1] Zhang X, Su Y, Tripathi S, et al. Text spotting transformers. CVPR, 2022: 9519-9528.

[2] Zhang Y, Sun P, Jiang Y, et al. ByteTrack: Multi-object tracking by associating every detection box. ECCV, 2022, LNCS, vol 13682: 1-21.

Ranking Table

Description Paper Source Code
DateMethodMOTAMOTPIDF1Mostly MatchedPartially MatchedMostly Lost
2023-03-26TransDETR37.529774.18%60.27%166517621563
2023-03-28ClusterFlow36.006370.29%61.19%175711942029
2023-03-28TH-DL31.072175.20%62.35%218014951317
2023-03-27TencentOCR V422.212870.09%52.02%121715402226
2023-03-28TH-DN22.058767.76%47.18%84010383099
2023-03-28roadtext-pingan21.218774.63%59.01%214812821557
2023-03-28roadtext-pingan21.218774.63%59.01%214812821557
2023-03-20roadText-pingan18.810974.55%56.84%221613331419
2023-03-28TencentOCR V517.959462.53%36.13%5518043316
2023-03-21TencentOCR V117.338865.67%36.72%6308633369
2023-03-28TencentOCR16.398366.59%42.58%7468943231
2023-03-20SCUT-MMOCR-KS-10.273871.84%56.91%23541660978
2023-03-24RoadText DRTE-27.611670.46%17.42%108316922214
2023-03-20YBP-27.842875.40%43.25%166615051821
2023-03-20YBP-27.884875.40%43.21%166615051821
2023-03-27solar flare-32.240169.20%17.34%57114952926
2023-03-19Road video text spotting -152.184251.84%17.47%139111251677

Ranking Graphic

Ranking Graphic