Overview - Comics Understanding

Comics, as a medium, uniquely combine text and images in styles often distinct from real-world visuals. For the past three decades, computational research on comics has evolved from basic object detection to more sophisticated tasks. However, the field faces persistent challenges such as:

  • small datasets
  • inconsistent annotations,
  • inaccessible model weights,
  • not directly comparable results due to varying train/test splits and metrics

To address these issues, we aim to standardize annotations across datasets, introduce a variety of comic styles into the datasets, and establish benchmark results with clear, replicable settings. The Comics Dataset Framework [1] provides standardized dataset detection annotations and conversion scripts for existing dataset images. Moreover, in a recent CoMix dataset [2], multi-task annotations have been added to the existing Manga and Comics dataset, extending the set of comic styles to a balanced combination of both styles (see Figure 1).

 

 

poster-2.png
Figure 1. Composition of the CoMix benchmark. The top part of the figure provides a qualitative representation of the datasets included in CoMix. The accompanying bar charts depict the differences between the original annotations and those extended in CoMix. The left chart shows the increased number of annotations per dataset, whereas the right chart details the increase per task.

 

 

Moreover, with the recent advancements in Vision and Language models [3,4], and in applications tailored to comics [5], current evaluation metrics and datasets in comics often lag behind model advancement, confined to small or single-style sets. The introduced CoMix benchmark is designed to assess the multi-task capabilities of comic analysis models, providing reading order annotations, character naming, and dialog generation, and proposing a new metric to evaluate models on these new benchmarks. The specifics of the multi-tasks CoMix benchmark are provided in Figure 2.

 

 

poster.png
Figure 2. The CoMix benchmark contains 4 computational tasks (object detection, speaker identification, character re-identification, panel-text sorting) and 2 multi-modal reasoning tasks (character naming and dialog generation) which require models to detect objects and their relation, as well as reading text. The figure shows the annotations added for each comic page, and on the left are depicted examples of annotations of multi-modal reasoning tasks.

 

 

The validation split is provided together with annotations. The held-out Test set is available through the server Task.

 

 

[1]: Comics Datasets Framework: Mix of Comics Datasets for benchmarking, 2024, link?

[2]: CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding, 2024, link?

[3]: GPT-4 Technical Report, 2023, arxiv

[4]: MiniCPM-V 2.5, 2024, blog

[5]: The Manga Whisperer: Automatically Generating Transcriptions for Comics, 2024, CVPR24

Challenge News

Important Dates