Authors: Yukun Zhai
Description: We only use OOV datasets to train our model.
In the detection stage, we follow TBNet and Mask2Former as the base model with a multi-scale training strategy.
To combine the final detection results, we ensemble different detectors with different backbones and different testing sizes.
And in the recognition stage, We use a vision transformer model that consists of ViT encoder and query-based decoder to generate the recognition results in parallel.