Overview - Comics Understanding

Comics, as a medium, uniquely combine text and images in styles often distinct from real-world visuals. For the past three decades, computational research on comics has evolved from basic object detection to more sophisticated tasks. However, the field faces persistent challenges such as:

  • small datasets
  • inconsistent annotations,
  • inaccessible model weights,
  • not directly comparable results due to varying train/test splits and metrics

To address these issues, we aim to standardize annotations across datasets, introduce a variety of comic styles into the datasets, and establish benchmark results with clear, replicable settings. The Comics Dataset Framework [1] provides standardized dataset detection annotations and conversion scripts for existing dataset images. Moreover, in a recent CoMix dataset [2], multi-task annotations have been added to the existing Manga and Comics dataset, extending the set of comic styles to a balanced combination of both styles (see Figure 1, left). In the GitHub framework CoMix, is present both the code for [1] and [2].

These works lead to the definition of various tasks, collected under the NEURIPS 2024 CoMix Dataset and Framework, and the ICDAR 2025 COMICS challenge (Challenge on Comics Understanding).

poster-2.png poster.png

Figure 1. Composition of the CoMix benchmark. Qualitative representation of the datasets (left-top) and differences between the original annotations and those extended in CoMix (left-bottom). An illustration of the annotation is also provided (right).

Moreover, with the recent advancements in Vision and Language models [3,4], and in applications tailored to comics [5, 6], current evaluation metrics and datasets in comics often lag behind model advancement, confined to small or single-style sets. The introduced CoMix benchmark is designed to assess the multi-task capabilities of comic analysis models, providing reading order annotations, character naming, and dialog generation, and proposing a new metric to evaluate models on these new benchmarks. The specifics of the multi-task CoMix benchmark are provided on the right side of Figure 1.

Starting from these progresses and from a recent survey on comics [7] which identifies the gap between the Vision-Language world and comics analysis, we have designed a set of sequences processing tasks. These are the tasks included in the ICDAR 2025 competition, based on the ``pick a panel`` format: given the context and a set of panels among which to choose, the model needs to identify the correct choice. The tasks are framed as classification tasks, all in the format of multi-panel input (with text) and the correct option index as the answer.

Task 1: Pick-a-Panel (ICDAR 2025)

The challenge comprises one task named ``pick a panel``, with which we frame three different skills that we briefly describe here. For more detailed information, refer to the Tasks section:

  • Skill 1 - Sequence Filling: This task is presented as choosing the right panel among a set of options. The context corresponds to a sequence of panels, with one missing panel in a specified location. The length of the context sequence could be from 3 to 7, and the missing item index from -1 to len(context), thus the missing could also be the previous panel (-1) or the following one (len(context)).
  • Skill 2 - Closure: This is the well-known closure task initially proposed in comics [8], but with the text-closure task reframed for (i) panel options input and (ii) text options input. Also in this case, as proposed in [9], the context length may vary as well as the possible options.
  • Skill 3 - Caption relevance: This task inherits from the previous tasks but adds a line of complexity: the context is given uniquely from a detailed description of the previous panel, 

Task 2: Multi-task Single-page (CoMix Benchmark)

For this task, please refer to the NEURIPS 2024 CoMix Benchmark, (hosted at https://github.com/emanuelevivoli/CoMix), where the instructions for gathering the data, the model weights, and validation split (with annotations) are provided. The task will be progressively hosted here. The held-out test set will be available through the server task soon. TBD: more to come in the incoming months!

References:

  1. Vivoli et al., "Comics Datasets Framework: Mix of Comics Datasets for benchmarking", 2024, ICDAR 2024
  2. Vivoli et al., "CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding", 2024, NeurIPS 2024
  3. OpenAI, "GPT-4 Technical Report", 2023, arxiv
  4. OpenBNB, "MiniCPM-V 2.5", 2024, blog
  5. Sachdeva et al., "The Manga Whisperer: Automatically Generating Transcriptions for Comics", 2024, CVPR 2024
  6. Sachdeva et al., "Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names", 2024, ACCV 2024
  7. Vivoli et al., "One missing piece in Vision and Language: A Survey on Comics Understanding", 2024, under revision
  8. Iyyer et al, "The Amazing Mysteries of the Gutter: Drawing Inferences Between Panels in Comic Book Narratives", 2016, CVPR 2017
  9. Vivoli et al. "Multimodal Transformer for Comics Text-Cloze", 2024, ICDAR 2024

Challenge News

Important Dates

ICDAR 2025 Edition

17-21/09/2025: Results presentation

30/04/2025: Camera-ready of competition report

20/04/2025: Initial submission of competition report

15/04/2025: Deadline for Competition submissions

25/02/2025: Benchmark model and Dev sets v0.1

19/02/2025: Test/Val sets v0.1 available

10/01/2025: Tasks have been updated for the ICDAR 2025 COMICS competition at https://www.icdar2025.com/program/competitions

 

CoMix Benchmark

November 2024: The dataset & evaluation repository CoMix https://github.com/emanuelevivoli/CoMix have been released