Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Xia, Yuxuan, Wang, Siheng, Li, Peng
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Computer Vision and Pattern Recognition Artificial Intelligence
Online-Zugang:	https://arxiv.org/abs/2601.03500
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866914238047453184
author	Xia, Yuxuan Wang, Siheng Li, Peng
author_facet	Xia, Yuxuan Wang, Siheng Li, Peng
contents	Large Vision-Language Models (LVLMs) demonstrate significant progress in multimodal understanding and reasoning, yet object hallucination remains a critical challenge. While existing research focuses on mitigating language priors or high-level statistical biases, they often overlook the internal complexities of the visual encoding process. We identify that visual statistical bias, arising from the inherent Bag-of-Patches behavior of Vision Encoders under weak structural supervision, acts as a contributing factor of object hallucinations. Under this bias, models prioritize local texture features within individual patches over holistic geometric structures. This tendency may induce spurious visual confidence and result in hallucinations. To address this, we introduce a training-free algorithm called Structure-Disrupted Contrastive Decoding (SDCD), which performs contrastive calibration of the output distribution by introducing a shuffled structure-disrupted view. By penalizing tokens that maintain high confidence under this structure-less view, SDCD effectively suppresses the texture-driven bias. Experimental results demonstrate that SDCD significantly mitigates hallucinations across multiple benchmarks and enhances the overall multimodal capabilities of LVLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_03500
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SDCD: Structure-Disrupted Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models Xia, Yuxuan Wang, Siheng Li, Peng Computer Vision and Pattern Recognition Artificial Intelligence Large Vision-Language Models (LVLMs) demonstrate significant progress in multimodal understanding and reasoning, yet object hallucination remains a critical challenge. While existing research focuses on mitigating language priors or high-level statistical biases, they often overlook the internal complexities of the visual encoding process. We identify that visual statistical bias, arising from the inherent Bag-of-Patches behavior of Vision Encoders under weak structural supervision, acts as a contributing factor of object hallucinations. Under this bias, models prioritize local texture features within individual patches over holistic geometric structures. This tendency may induce spurious visual confidence and result in hallucinations. To address this, we introduce a training-free algorithm called Structure-Disrupted Contrastive Decoding (SDCD), which performs contrastive calibration of the output distribution by introducing a shuffled structure-disrupted view. By penalizing tokens that maintain high confidence under this structure-less view, SDCD effectively suppresses the texture-driven bias. Experimental results demonstrate that SDCD significantly mitigates hallucinations across multiple benchmarks and enhances the overall multimodal capabilities of LVLMs.
title	SDCD: Structure-Disrupted Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2601.03500

Ähnliche Einträge