Method: ClusterFlow - Task 1 - Text Detection and Tracking - RoadText Competition on Video Text Detection, Tracking and Recognition

method: ClusterFlow2023-03-28

Authors: Anthony Sherbondy, Renshen Wang

Affiliation: Google

Description: ClusterFlow is especially designed to address the problem of extracting text from videos as presented in the RoadText1k dataset. The main motivation is to demonstrate the utility of combining commodity algorithms for OCR, optical flow, clustering and classification with decision trees.

First, we use a public cloud API for extracting OCR results at line-level granularity on every image frame (~300) of each video. Next, we use a modern RAFT implementation to calculate a dense optical flow field at the pixel level for every image. The optical flow field is then used to extrude the OCR line results temporally to create tubes or tracklets of lines. Next, an unsupervised clustering algorithm is used to group the line text tracklets into clusters across the entire video. The distance metric between tracklets, clustering algorithm and hyperparameters for the clustering algorithm is searched on the training dataset.

Given the clustered tracklets, the algorithm then selects geometry and text from the tracklet to create tracked lines that have at most a single appearance within any video frame. To do this a set of features are generated from each line appearance, tracklet and cluster and input into a classification algorithm. The classification algorithm is trained to select the appearances of the cluster that would match with groundtruth in the training set. At inference the classification probabilities are used to select amongst possible line text appearances within a cluster at any video frame.