Results - RoadText Competition on Video Text Detection, Tracking and Recognition

method: TransDETR2023-03-26

Authors: Yu Hao, Chuhui Xue, Wenqing Zhang, Song Bai

Affiliation: ByteDance Inc.

Description: The method we use is TransDETR[1]. First, we get the weights pre-trained on the ICDAR2015 video, then use the RoadText3K and BOVText to fine-tune the network for 20 epochs. Finally, we use the RoadText to fine-tune the network for 20 epoch.

[1] End-to-end Video Text Spotting with Transformer
[2] Read while you drive - multilingual text tracking on the road

method: ClusterFlow2023-03-28

Authors: Anthony Sherbondy, Renshen Wang

Affiliation: Google

Email: tonysherbondy@google.com

Description: ClusterFlow is especially designed to address the problem of extracting text from videos as presented in the RoadText1k dataset. The main motivation is to demonstrate the utility of combining commodity algorithms for OCR, optical flow, clustering and classification with decision trees.

First, we use a public cloud API for extracting OCR results at line-level granularity on every image frame (~300) of each video. Next, we use a modern RAFT implementation to calculate a dense optical flow field at the pixel level for every image. The optical flow field is then used to extrude the OCR line results temporally to create tubes or tracklets of lines. Next, an unsupervised clustering algorithm is used to group the line text tracklets into clusters across the entire video. The distance metric between tracklets, clustering algorithm and hyperparameters for the clustering algorithm is searched on the training dataset.

Given the clustered tracklets, the algorithm then selects geometry and text from the tracklet to create tracked lines that have at most a single appearance within any video frame. To do this a set of features are generated from each line appearance, tracklet and cluster and input into a classification algorithm. The classification algorithm is trained to select the appearances of the cluster that would match with groundtruth in the training set. At inference the classification probabilities are used to select amongst possible line text appearances within a cluster at any video frame.

method: TH-DL2023-03-28

Authors: Gang Yao*, Ning Ding*, Kemeng Zhao, Huan Yu, Pei Tang, Haodong Shi, Liangrui Peng [*equal contribution]

Affiliation: Tsinghua University

Email: dn22@mails.tsinghua.edu.cn

Description: The TH-DL method provides an integrated scheme for text detection, recognition, and tracking in driving videos. For text detection and recognition, TESTR[1] based on Transformer is adopted. The pre-trained TESTR model is finetuned on the training set of the Roadtext Challenge. For multi-object tracking, ByteTrack[2] is employed which uses the similarities with tracklets to recover true objects from low score detection boxes. Post-processing module is added to filter duplicate instances of text detection and recognition.

[1] Zhang X, Su Y, Tripathi S, et al. Text spotting transformers. CVPR, 2022: 9519-9528.

[2] Zhang Y, Sun P, Jiang Y, et al. ByteTrack: Multi-object tracking by associating every detection box. ECCV, 2022, LNCS, vol 13682: 1-21.

Ranking Table

Description Paper Source Code

Date	Method	MOTA	MOTP	IDF1	Mostly Matched	Partially Matched	Mostly Lost
2023-03-26	TransDETR	37.5297	74.18%	60.27%	1665	1762	1563
2023-03-28	ClusterFlow	36.0063	70.29%	61.19%	1757	1194	2029
2023-03-28	TH-DL	31.0721	75.20%	62.35%	2180	1495	1317
2023-03-27	TencentOCR V4	22.2128	70.09%	52.02%	1217	1540	2226
2023-03-28	TH-DN	22.0587	67.76%	47.18%	840	1038	3099
2023-03-28	roadtext-pingan	21.2187	74.63%	59.01%	2148	1282	1557
2023-03-28	roadtext-pingan	21.2187	74.63%	59.01%	2148	1282	1557
2023-03-20	roadText-pingan	18.8109	74.55%	56.84%	2216	1333	1419
2023-03-28	TencentOCR V5	17.9594	62.53%	36.13%	551	804	3316
2023-03-21	TencentOCR V1	17.3388	65.67%	36.72%	630	863	3369
2023-03-28	TencentOCR	16.3983	66.59%	42.58%	746	894	3231
2023-03-20	SCUT-MMOCR-KS	-10.2738	71.84%	56.91%	2354	1660	978
2023-03-24	RoadText DRTE	-27.6116	70.46%	17.42%	1083	1692	2214
2023-03-20	YBP	-27.8428	75.40%	43.25%	1666	1505	1821
2023-03-20	YBP	-27.8848	75.40%	43.21%	1666	1505	1821
2023-03-27	solar flare	-32.2401	69.20%	17.34%	571	1495	2926
2023-03-19	Road video text spotting	-152.1842	51.84%	17.47%	1391	1125	1677

Inactive evaluations

method: TransDETR2023-03-26

method: ClusterFlow2023-03-28

method: TH-DL2023-03-28

Ranking Table

Ranking Graphic

Ranking Graphic