Saved in:
Bibliographic Details
Main Authors: Yue, Zihao, Zhang, Liang, Jin, Qin
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.14545
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909212541452288
author Yue, Zihao
Zhang, Liang
Jin, Qin
author_facet Yue, Zihao
Zhang, Liang
Jin, Qin
contents Large Multimodal Models (LMMs) often suffer from multimodal hallucinations, wherein they may create content that is not present in the visual inputs. In this paper, we explore a new angle of this issue: overly detailed training data hinders the model's ability to timely terminate generation, leading to continued outputs beyond visual perception limits. By investigating how the model decides to terminate generation with EOS, the special end-of-sentence token, we find that the model assesses the completeness of the entire sequence by comparing the generated text with the image. This observation suggests that the model possesses an inherent potential of making proper EOS decisions based on its visual perception to avoid overly lengthy outputs. To take advantage of such potential, we explore two methods to mitigate multimodal hallucinations: a training objective that enables the model to reduce hallucinations by learning from regular instruction data, and a data filtering strategy to prevent harmful training data from exacerbating model hallucinations. Both methods significantly improve the hallucination performance of LMMs, without requiring any additional data or knowledge.
format Preprint
id arxiv_https___arxiv_org_abs_2402_14545
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective
Yue, Zihao
Zhang, Liang
Jin, Qin
Computation and Language
Computer Vision and Pattern Recognition
Large Multimodal Models (LMMs) often suffer from multimodal hallucinations, wherein they may create content that is not present in the visual inputs. In this paper, we explore a new angle of this issue: overly detailed training data hinders the model's ability to timely terminate generation, leading to continued outputs beyond visual perception limits. By investigating how the model decides to terminate generation with EOS, the special end-of-sentence token, we find that the model assesses the completeness of the entire sequence by comparing the generated text with the image. This observation suggests that the model possesses an inherent potential of making proper EOS decisions based on its visual perception to avoid overly lengthy outputs. To take advantage of such potential, we explore two methods to mitigate multimodal hallucinations: a training objective that enables the model to reduce hallucinations by learning from regular instruction data, and a data filtering strategy to prevent harmful training data from exacerbating model hallucinations. Both methods significantly improve the hallucination performance of LMMs, without requiring any additional data or knowledge.
title Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective
topic Computation and Language
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2402.14545