Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Banerjee, Sagarika, Madi, Tangatar, Swaminathan, Advait, Anh, Nguyen Dao Minh, Garg, Shivank, Zhu, Kevin, Sharma, Vasu
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.18729
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908846404927488
author	Banerjee, Sagarika Madi, Tangatar Swaminathan, Advait Anh, Nguyen Dao Minh Garg, Shivank Zhu, Kevin Sharma, Vasu
author_facet	Banerjee, Sagarika Madi, Tangatar Swaminathan, Advait Anh, Nguyen Dao Minh Garg, Shivank Zhu, Kevin Sharma, Vasu
contents	Fine-grained image-caption alignment is crucial for vision-language models (VLMs), especially in socially critical contexts such as identifying real-world risk scenarios or distinguishing cultural proxies, where correct interpretation hinges on subtle visual or linguistic clues and where minor misinterpretations can lead to significant real-world consequences. We present MiSCHiEF, a set of two benchmarking datasets based on a contrastive pair design in the domains of safety (MiS) and culture (MiC), and evaluate four VLMs on tasks requiring fine-grained differentiation of paired images and captions. In both datasets, each sample contains two minimally differing captions and corresponding minimally differing images. In MiS, the image-caption pairs depict a safe and an unsafe scenario, while in MiC, they depict cultural proxies in two distinct cultural contexts. We find that models generally perform better at confirming the correct image-caption pair than rejecting incorrect ones. Additionally, models achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image, compared to the converse task. The results, overall, highlight persistent modality misalignment challenges in current VLMs, underscoring the difficulty of precise cross-modal grounding required for applications with subtle semantic and visual distinctions.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_18729
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment Banerjee, Sagarika Madi, Tangatar Swaminathan, Advait Anh, Nguyen Dao Minh Garg, Shivank Zhu, Kevin Sharma, Vasu Computer Vision and Pattern Recognition Artificial Intelligence Fine-grained image-caption alignment is crucial for vision-language models (VLMs), especially in socially critical contexts such as identifying real-world risk scenarios or distinguishing cultural proxies, where correct interpretation hinges on subtle visual or linguistic clues and where minor misinterpretations can lead to significant real-world consequences. We present MiSCHiEF, a set of two benchmarking datasets based on a contrastive pair design in the domains of safety (MiS) and culture (MiC), and evaluate four VLMs on tasks requiring fine-grained differentiation of paired images and captions. In both datasets, each sample contains two minimally differing captions and corresponding minimally differing images. In MiS, the image-caption pairs depict a safe and an unsafe scenario, while in MiC, they depict cultural proxies in two distinct cultural contexts. We find that models generally perform better at confirming the correct image-caption pair than rejecting incorrect ones. Additionally, models achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image, compared to the converse task. The results, overall, highlight persistent modality misalignment challenges in current VLMs, underscoring the difficulty of precise cross-modal grounding required for applications with subtle semantic and visual distinctions.
title	MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2602.18729

Similar Items