Saved in:
Bibliographic Details
Main Authors: Testa, Davide, Bonetta, Giovanni, Bernardi, Raffaella, Bondielli, Alessandro, Lenci, Alessandro, Miaschi, Alessio, Passaro, Lucia, Magnini, Bernardo
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.16989
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916959510069248
author Testa, Davide
Bonetta, Giovanni
Bernardi, Raffaella
Bondielli, Alessandro
Lenci, Alessandro
Miaschi, Alessio
Passaro, Lucia
Magnini, Bernardo
author_facet Testa, Davide
Bonetta, Giovanni
Bernardi, Raffaella
Bondielli, Alessandro
Lenci, Alessandro
Miaschi, Alessio
Passaro, Lucia
Magnini, Bernardo
contents We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an open-ended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs' consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models' fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers.
format Preprint
id arxiv_https___arxiv_org_abs_2502_16989
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark
Testa, Davide
Bonetta, Giovanni
Bernardi, Raffaella
Bondielli, Alessandro
Lenci, Alessandro
Miaschi, Alessio
Passaro, Lucia
Magnini, Bernardo
Computation and Language
We introduce MAIA (Multimodal AI Assessment), a native-Italian benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos. MAIA evaluates Vision Language Models (VLMs) on two aligned tasks: a visual statement verification task, and an open-ended visual question-answering task, both on the same set of video-related questions. It considers twelve reasoning categories that aim to disentangle language and vision relations by highlighting the role of the visual input. Thanks to its carefully taught design, it evaluates VLMs' consistency and visually grounded natural language comprehension and generation simultaneously through an aggregated metric revealing low results that highlight models' fragility. Last but not least, the video collection has been carefully selected to reflect the Italian culture, and the language data are produced by native-speakers.
title All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark
topic Computation and Language
url https://arxiv.org/abs/2502.16989