Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Chen, Boqi, Liu, Xudong, Ao, Yunke, Qiu, Jianing
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Computation and Language
Online-Zugang:	https://arxiv.org/abs/2604.23443
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866910166418456576
author	Chen, Boqi Liu, Xudong Ao, Yunke Qiu, Jianing
author_facet	Chen, Boqi Liu, Xudong Ao, Yunke Qiu, Jianing
contents	Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for Reasoning Models, which outperforms both stochastic sampling and standard greedy decoding in multimodal reasoning scenarios. Overall, our results caution against naively inheriting LLMs decoding heuristics in MLLMs and demonstrate that greedy decoding can be an efficient yet strong default for VQA.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_23443
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective Chen, Boqi Liu, Xudong Ao, Yunke Qiu, Jianing Computation and Language Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for Reasoning Models, which outperforms both stochastic sampling and standard greedy decoding in multimodal reasoning scenarios. Overall, our results caution against naively inheriting LLMs decoding heuristics in MLLMs and demonstrate that greedy decoding can be an efficient yet strong default for VQA.
title	Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective
topic	Computation and Language
url	https://arxiv.org/abs/2604.23443

Ähnliche Einträge