Overview - Comics Understanding

Comics, as a medium, uniquely combine text and images in styles often distinct from real-world visuals. For the past three decades, computational research on comics has evolved from basic object detection to more sophisticated tasks. However, the field faces persistent challenges such as:

small datasets
inconsistent annotations,
inaccessible model weights,
not directly comparable results due to varying train/test splits and metrics

To address these issues, we aim to standardize annotations across datasets, introduce a variety of comic styles into the datasets, and establish benchmark results with clear, replicable settings. The Comics Dataset Framework [1] provides standardized dataset detection annotations and conversion scripts for existing dataset images. Moreover, in a recent CoMix dataset [2], multi-task annotations have been added to the existing Manga and Comics dataset, extending the set of comic styles to a balanced combination of both styles (see Figure 1, left). In the GitHub framework CoMix, is present both the code for [1] and [2].

These works lead to the definition of various tasks, collected under the NEURIPS 2024 CoMix Dataset and Framework, and the ICDAR 2025 COMICS challenge (Challenge on Comics Understanding).

Figure 1. Composition of the CoMix benchmark. Qualitative representation of the datasets (left-top) and differences between the original annotations and those extended in CoMix (left-bottom). An illustration of the annotation is also provided (right).

Moreover, with the recent advancements in Vision and Language models [3,4], and in applications tailored to comics [5, 6], current evaluation metrics and datasets in comics often lag behind model advancement, confined to small or single-style sets. The introduced CoMix benchmark is designed to assess the multi-task capabilities of comic analysis models, providing reading order annotations, character naming, and dialog generation, and proposing a new metric to evaluate models on these new benchmarks. The specifics of the multi-task CoMix benchmark are provided on the right side of Figure 1.

Starting from these progresses and from a recent survey on comics [7] which identifies the gap between the Vision-Language world and comics analysis, we have designed a set of sequences processing tasks. These are the tasks included in the ICDAR 2025 competition, based on the ``pick a panel`` format: given the context and a set of panels among which to choose, the model needs to identify the correct choice. The tasks are framed as classification tasks, all in the format of multi-panel input (with text) and the correct option index as the answer.

Task 1: Pick-a-Panel (ICDAR 2025)

The challenge comprises one task named ``pick a panel``, with which we frame three different skills that we briefly describe here. For more detailed information, refer to the Tasks section:

Skill 1 - Sequence Filling: This task is presented as choosing the right panel among a set of options. The context corresponds to a sequence of panels, with one missing panel in a specified location. The length of the context sequence could be from 3 to 7, and the missing item index from -1 to len(context), thus the missing could also be the previous panel (-1) or the following one (len(context)).
Skill 2 - Closure: This is the well-known closure task initially proposed in comics [8], but with the text-closure task reframed for (i) panel options input and (ii) text options input. Also in this case, as proposed in [9], the context length may vary as well as the possible options.
Skill 3 - Caption relevance: This task inherits from the previous tasks but adds a line of complexity: the context is given uniquely from a detailed description of the previous panel,

Task 2: Multi-task Single-page (CoMix Benchmark)

For this task, please refer to the NEURIPS 2024 CoMix Benchmark, (hosted at https://github.com/emanuelevivoli/CoMix), where the instructions for gathering the data, the model weights, and validation split (with annotations) are provided. The task will be progressively hosted here. The held-out test set will be available through the server task soon. TBD: more to come in the incoming months!

References:

Vivoli et al., "Comics Datasets Framework: Mix of Comics Datasets for benchmarking", 2024, ICDAR 2024
Vivoli et al., "CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding", 2024, NeurIPS 2024
OpenAI, "GPT-4 Technical Report", 2023, arxiv
OpenBNB, "MiniCPM-V 2.5", 2024, blog
Sachdeva et al., "The Manga Whisperer: Automatically Generating Transcriptions for Comics", 2024, CVPR 2024
Sachdeva et al., "Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names", 2024, A CCV 2024
Vivoli et al., "One missing piece in Vision and Language: A Survey on Comics Understanding", 2024, under revision
Iyyer et al, "The Amazing Mysteries of the Gutter: Drawing Inferences Between Panels in Comic Book Narratives", 2016, CVPR 2017
Vivoli et al. "Multimodal Transformer for Comics Text-Cloze", 2024, ICDAR 2024

Challenge News

Important Dates

ICDAR 2025 Edition

17-21/09/2025: Results presentation

30/04/2025: Submission of Technical Report

22/04/2025: Deadline for Competition submissions

25/02/2025: Benchmark model and Dev sets v0.1

19/02/2025: Test/Val sets v0.1 available

10/01/2025: Tasks have been updated for the ICDAR 2025 Comics Competition (https://www.icdar2025.com/program/competitions)

CoMix Benchmark

November 2024: The dataset & evaluation repository CoMix https://github.com/emanuelevivoli/CoMix have been released