Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Shu, Dong, Zhao, Haiyan, Hu, Jingyu, Liu, Weiru, Payani, Ali, Cheng, Lu, Du, Mengnan
Format:	Preprint
Publié:	2025
Sujets:	Computer Vision and Pattern Recognition Computation and Language
Accès en ligne:	https://arxiv.org/abs/2501.01346
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866909800740159488
author	Shu, Dong Zhao, Haiyan Hu, Jingyu Liu, Weiru Payani, Ali Cheng, Lu Du, Mengnan
author_facet	Shu, Dong Zhao, Haiyan Hu, Jingyu Liu, Weiru Payani, Ali Cheng, Lu Du, Mengnan
contents	Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in processing both visual and textual information. However, the critical challenge of alignment between visual and textual representations is not fully understood. This survey presents a comprehensive examination of alignment and misalignment in LVLMs through an explainability lens. We first examine the fundamentals of alignment, exploring its representational and behavioral aspects, training methodologies, and theoretical foundations. We then analyze misalignment phenomena across three semantic levels: object, attribute, and relational misalignment. Our investigation reveals that misalignment emerges from challenges at multiple levels: the data level, the model level, and the inference level. We provide a comprehensive review of existing mitigation strategies, categorizing them into parameter-frozen and parameter-tuning approaches. Finally, we outline promising future research directions, emphasizing the need for standardized evaluation protocols and in-depth explainability studies.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_01346
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability Shu, Dong Zhao, Haiyan Hu, Jingyu Liu, Weiru Payani, Ali Cheng, Lu Du, Mengnan Computer Vision and Pattern Recognition Computation and Language Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in processing both visual and textual information. However, the critical challenge of alignment between visual and textual representations is not fully understood. This survey presents a comprehensive examination of alignment and misalignment in LVLMs through an explainability lens. We first examine the fundamentals of alignment, exploring its representational and behavioral aspects, training methodologies, and theoretical foundations. We then analyze misalignment phenomena across three semantic levels: object, attribute, and relational misalignment. Our investigation reveals that misalignment emerges from challenges at multiple levels: the data level, the model level, and the inference level. We provide a comprehensive review of existing mitigation strategies, categorizing them into parameter-frozen and parameter-tuning approaches. Finally, we outline promising future research directions, emphasizing the need for standardized evaluation protocols and in-depth explainability studies.
title	Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability
topic	Computer Vision and Pattern Recognition Computation and Language
url	https://arxiv.org/abs/2501.01346

Documents similaires