MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Ryan, Yuriel, Ip, Hei Man, Kuek, Adriel, Liang, Paul Pu, Lee, Roy Ka-Wei
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning
Accesso online:	https://arxiv.org/abs/2605.08145
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866918530915500032
author	Ryan, Yuriel Ip, Hei Man Kuek, Adriel Liang, Paul Pu Lee, Roy Ka-Wei
author_facet	Ryan, Yuriel Ip, Hei Man Kuek, Adriel Liang, Paul Pu Lee, Roy Ka-Wei
contents	Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities to compensate for the impaired one. To this end, we analyze multimodal interactions -- redundant (shared), unique (exclusive), and synergistic (emergent) task-relevant information provided by the modalities -- to determine their impacts on model reliability. Specifically, amplifying redundant interactions would increase this exploitable shared information to resolve these issues; yet, modern instruction datasets often eliminate redundancies to prioritize visual grounding. We bridge this gap through a self-captioning workflow featuring a \textsc{Multimodal Interaction Gate}: a mechanism to convert unique interactions into redundant interactions. Our findings suggest that increasing redundancy can reduce visual induced errors by 38.3\% and improve consistency by 16.8\%.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_08145
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models Ryan, Yuriel Ip, Hei Man Kuek, Adriel Liang, Paul Pu Lee, Roy Ka-Wei Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities to compensate for the impaired one. To this end, we analyze multimodal interactions -- redundant (shared), unique (exclusive), and synergistic (emergent) task-relevant information provided by the modalities -- to determine their impacts on model reliability. Specifically, amplifying redundant interactions would increase this exploitable shared information to resolve these issues; yet, modern instruction datasets often eliminate redundancies to prioritize visual grounding. We bridge this gap through a self-captioning workflow featuring a \textsc{Multimodal Interaction Gate}: a mechanism to convert unique interactions into redundant interactions. Our findings suggest that increasing redundancy can reduce visual induced errors by 38.3\% and improve consistency by 16.8\%.
title	Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models
topic	Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2605.08145

Documenti analoghi