Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yue, Zihao, Zhang, Liang, Jin, Qin
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2402.14545
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909212541452288
author	Yue, Zihao Zhang, Liang Jin, Qin
author_facet	Yue, Zihao Zhang, Liang Jin, Qin
contents	Large Multimodal Models (LMMs) often suffer from multimodal hallucinations, wherein they may create content that is not present in the visual inputs. In this paper, we explore a new angle of this issue: overly detailed training data hinders the model's ability to timely terminate generation, leading to continued outputs beyond visual perception limits. By investigating how the model decides to terminate generation with EOS, the special end-of-sentence token, we find that the model assesses the completeness of the entire sequence by comparing the generated text with the image. This observation suggests that the model possesses an inherent potential of making proper EOS decisions based on its visual perception to avoid overly lengthy outputs. To take advantage of such potential, we explore two methods to mitigate multimodal hallucinations: a training objective that enables the model to reduce hallucinations by learning from regular instruction data, and a data filtering strategy to prevent harmful training data from exacerbating model hallucinations. Both methods significantly improve the hallucination performance of LMMs, without requiring any additional data or knowledge.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_14545
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective Yue, Zihao Zhang, Liang Jin, Qin Computation and Language Computer Vision and Pattern Recognition Large Multimodal Models (LMMs) often suffer from multimodal hallucinations, wherein they may create content that is not present in the visual inputs. In this paper, we explore a new angle of this issue: overly detailed training data hinders the model's ability to timely terminate generation, leading to continued outputs beyond visual perception limits. By investigating how the model decides to terminate generation with EOS, the special end-of-sentence token, we find that the model assesses the completeness of the entire sequence by comparing the generated text with the image. This observation suggests that the model possesses an inherent potential of making proper EOS decisions based on its visual perception to avoid overly lengthy outputs. To take advantage of such potential, we explore two methods to mitigate multimodal hallucinations: a training objective that enables the model to reduce hallucinations by learning from regular instruction data, and a data filtering strategy to prevent harmful training data from exacerbating model hallucinations. Both methods significantly improve the hallucination performance of LMMs, without requiring any additional data or knowledge.
title	Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective
topic	Computation and Language Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2402.14545

Similar Items