Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xia, Yuxi, Ulmer, Dennis, Blevins, Terra, Liu, Yihong, Schütze, Hinrich, Roth, Benjamin
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2601.08064
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916056999657472
author	Xia, Yuxi Ulmer, Dennis Blevins, Terra Liu, Yihong Schütze, Hinrich Roth, Benjamin
author_facet	Xia, Yuxi Ulmer, Dennis Blevins, Terra Liu, Yihong Schütze, Hinrich Roth, Benjamin
contents	Confidence estimation (CE) indicates how reliable the answers of large language models are and impacts user trust and decision-making. Existing evaluations mainly concern the alignment between confidence and correctness, but ignore the variability of language: confidence estimates should remain consistent under semantically equivalent prompts or answer variations, while changing when answer meaning differs, as this may indicate a change in correctness. Therefore, we introduce a novel evaluation framework based on three complementary properties: \textbf{robustness} to prompt perturbations, \textbf{stability} across semantically equivalent answers, and \textbf{sensitivity} to semantically different answers. We show that these metrics are largely independent from existing CE metrics, and that common CE methods often fail on them: while most methods achieve high robustness and stability, they struggle to distinguish semantically different answers, potentially because they do not effectively leverage generation-side information. Overall, our framework exposes overlooked limitations of current CE evaluations and provides guidance for selecting confidence estimators for real-world applications.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_08064
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations Xia, Yuxi Ulmer, Dennis Blevins, Terra Liu, Yihong Schütze, Hinrich Roth, Benjamin Computation and Language Confidence estimation (CE) indicates how reliable the answers of large language models are and impacts user trust and decision-making. Existing evaluations mainly concern the alignment between confidence and correctness, but ignore the variability of language: confidence estimates should remain consistent under semantically equivalent prompts or answer variations, while changing when answer meaning differs, as this may indicate a change in correctness. Therefore, we introduce a novel evaluation framework based on three complementary properties: \textbf{robustness} to prompt perturbations, \textbf{stability} across semantically equivalent answers, and \textbf{sensitivity} to semantically different answers. We show that these metrics are largely independent from existing CE metrics, and that common CE methods often fail on them: while most methods achieve high robustness and stability, they struggle to distinguish semantically different answers, potentially because they do not effectively leverage generation-side information. Overall, our framework exposes overlooked limitations of current CE evaluations and provides guidance for selecting confidence estimators for real-world applications.
title	Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations
topic	Computation and Language
url	https://arxiv.org/abs/2601.08064

Similar Items