Saved in:
Bibliographic Details
Main Authors: Xia, Yuxi, Ulmer, Dennis, Blevins, Terra, Liu, Yihong, Schütze, Hinrich, Roth, Benjamin
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.08064
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916056999657472
author Xia, Yuxi
Ulmer, Dennis
Blevins, Terra
Liu, Yihong
Schütze, Hinrich
Roth, Benjamin
author_facet Xia, Yuxi
Ulmer, Dennis
Blevins, Terra
Liu, Yihong
Schütze, Hinrich
Roth, Benjamin
contents Confidence estimation (CE) indicates how reliable the answers of large language models are and impacts user trust and decision-making. Existing evaluations mainly concern the alignment between confidence and correctness, but ignore the variability of language: confidence estimates should remain consistent under semantically equivalent prompts or answer variations, while changing when answer meaning differs, as this may indicate a change in correctness. Therefore, we introduce a novel evaluation framework based on three complementary properties: \textbf{robustness} to prompt perturbations, \textbf{stability} across semantically equivalent answers, and \textbf{sensitivity} to semantically different answers. We show that these metrics are largely independent from existing CE metrics, and that common CE methods often fail on them: while most methods achieve high robustness and stability, they struggle to distinguish semantically different answers, potentially because they do not effectively leverage generation-side information. Overall, our framework exposes overlooked limitations of current CE evaluations and provides guidance for selecting confidence estimators for real-world applications.
format Preprint
id arxiv_https___arxiv_org_abs_2601_08064
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations
Xia, Yuxi
Ulmer, Dennis
Blevins, Terra
Liu, Yihong
Schütze, Hinrich
Roth, Benjamin
Computation and Language
Confidence estimation (CE) indicates how reliable the answers of large language models are and impacts user trust and decision-making. Existing evaluations mainly concern the alignment between confidence and correctness, but ignore the variability of language: confidence estimates should remain consistent under semantically equivalent prompts or answer variations, while changing when answer meaning differs, as this may indicate a change in correctness. Therefore, we introduce a novel evaluation framework based on three complementary properties: \textbf{robustness} to prompt perturbations, \textbf{stability} across semantically equivalent answers, and \textbf{sensitivity} to semantically different answers. We show that these metrics are largely independent from existing CE metrics, and that common CE methods often fail on them: while most methods achieve high robustness and stability, they struggle to distinguish semantically different answers, potentially because they do not effectively leverage generation-side information. Overall, our framework exposes overlooked limitations of current CE evaluations and provides guidance for selecting confidence estimators for real-world applications.
title Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations
topic Computation and Language
url https://arxiv.org/abs/2601.08064