method: TransDETR2023-03-26
Authors: Yu Hao, Chuhui Xue, Wenqing Zhang, Song Bai
Affiliation: ByteDance Inc.
Email: jinyu121@gmail.com
Description: The method we use is TransDETR[1]. First, we get the weights pre-trained on the ICDAR2015 video, then use the RoadText3K and BOVText to fine-tune the network for 20 epochs. Finally, we use the RoadText to fine-tune the network for 20 epoch.
[1] End-to-end Video Text Spotting with Transformer
[2] Read while you drive - multilingual text tracking on the road
method: ClusterFlow2023-03-28
Authors: Anthony Sherbondy, Renshen Wang
Affiliation: Google
Email: tonysherbondy@google.com
Description: ClusterFlow is especially designed to address the problem of extracting text from videos as presented in the RoadText1k dataset. The main motivation is to demonstrate the utility of combining commodity algorithms for OCR, optical flow, clustering and classification with decision trees.
First, we use a public cloud API for extracting OCR results at line-level granularity on every image frame (~300) of each video. Next, we use a modern RAFT implementation to calculate a dense optical flow field at the pixel level for every image. The optical flow field is then used to extrude the OCR line results temporally to create tubes or tracklets of lines. Next, an unsupervised clustering algorithm is used to group the line text tracklets into clusters across the entire video. The distance metric between tracklets, clustering algorithm and hyperparameters for the clustering algorithm is searched on the training dataset.
Given the clustered tracklets, the algorithm then selects geometry and text from the tracklet to create tracked lines that have at most a single appearance within any video frame. To do this a set of features are generated from each line appearance, tracklet and cluster and input into a classification algorithm. The classification algorithm is trained to select the appearances of the cluster that would match with groundtruth in the training set. At inference the classification probabilities are used to select amongst possible line text appearances within a cluster at any video frame.
method: TH-DL2023-03-28
Authors: Gang Yao*, Ning Ding*, Kemeng Zhao, Huan Yu, Pei Tang, Haodong Shi, Liangrui Peng [*equal contribution]
Affiliation: Tsinghua University
Email: dn22@mails.tsinghua.edu.cn
Description: The TH-DL method provides an integrated scheme for text detection, recognition, and tracking in driving videos. For text detection and recognition, TESTR[1] based on Transformer is adopted. The pre-trained TESTR model is finetuned on the training set of the Roadtext Challenge. For multi-object tracking, ByteTrack[2] is employed which uses the similarities with tracklets to recover true objects from low score detection boxes. Post-processing module is added to filter duplicate instances of text detection and recognition.
[1] Zhang X, Su Y, Tripathi S, et al. Text spotting transformers. CVPR, 2022: 9519-9528.
[2] Zhang Y, Sun P, Jiang Y, et al. ByteTrack: Multi-object tracking by associating every detection box. ECCV, 2022, LNCS, vol 13682: 1-21.
Date | Method | MOTA | MOTP | IDF1 | Mostly Matched | Partially Matched | Mostly Lost | |||
---|---|---|---|---|---|---|---|---|---|---|
2023-03-26 | TransDETR | 37.5297 | 74.18% | 60.27% | 1665 | 1762 | 1563 | |||
2023-03-28 | ClusterFlow | 36.0063 | 70.29% | 61.19% | 1757 | 1194 | 2029 | |||
2023-03-28 | TH-DL | 31.0721 | 75.20% | 62.35% | 2180 | 1495 | 1317 | |||
2023-03-27 | TencentOCR V4 | 22.2128 | 70.09% | 52.02% | 1217 | 1540 | 2226 | |||
2023-03-28 | TH-DN | 22.0587 | 67.76% | 47.18% | 840 | 1038 | 3099 | |||
2023-03-28 | roadtext-pingan | 21.2187 | 74.63% | 59.01% | 2148 | 1282 | 1557 | |||
2023-03-28 | roadtext-pingan | 21.2187 | 74.63% | 59.01% | 2148 | 1282 | 1557 | |||
2023-03-20 | roadText-pingan | 18.8109 | 74.55% | 56.84% | 2216 | 1333 | 1419 | |||
2023-03-28 | TencentOCR V5 | 17.9594 | 62.53% | 36.13% | 551 | 804 | 3316 | |||
2023-03-21 | TencentOCR V1 | 17.3388 | 65.67% | 36.72% | 630 | 863 | 3369 | |||
2023-03-28 | TencentOCR | 16.3983 | 66.59% | 42.58% | 746 | 894 | 3231 | |||
2023-03-20 | SCUT-MMOCR-KS | -10.2738 | 71.84% | 56.91% | 2354 | 1660 | 978 | |||
2023-03-24 | RoadText DRTE | -27.6116 | 70.46% | 17.42% | 1083 | 1692 | 2214 | |||
2023-03-20 | YBP | -27.8428 | 75.40% | 43.25% | 1666 | 1505 | 1821 | |||
2023-03-20 | YBP | -27.8848 | 75.40% | 43.21% | 1666 | 1505 | 1821 | |||
2023-03-27 | solar flare | -32.2401 | 69.20% | 17.34% | 571 | 1495 | 2926 | |||
2023-03-19 | Road video text spotting | -152.1842 | 51.84% | 17.47% | 1391 | 1125 | 1677 |