Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.08064 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866916056999657472 |
|---|---|
| author | Xia, Yuxi Ulmer, Dennis Blevins, Terra Liu, Yihong Schütze, Hinrich Roth, Benjamin |
| author_facet | Xia, Yuxi Ulmer, Dennis Blevins, Terra Liu, Yihong Schütze, Hinrich Roth, Benjamin |
| contents | Confidence estimation (CE) indicates how reliable the answers of large language models are and impacts user trust and decision-making. Existing evaluations mainly concern the alignment between confidence and correctness, but ignore the variability of language: confidence estimates should remain consistent under semantically equivalent prompts or answer variations, while changing when answer meaning differs, as this may indicate a change in correctness. Therefore, we introduce a novel evaluation framework based on three complementary properties: \textbf{robustness} to prompt perturbations, \textbf{stability} across semantically equivalent answers, and \textbf{sensitivity} to semantically different answers. We show that these metrics are largely independent from existing CE metrics, and that common CE methods often fail on them: while most methods achieve high robustness and stability, they struggle to distinguish semantically different answers, potentially because they do not effectively leverage generation-side information. Overall, our framework exposes overlooked limitations of current CE evaluations and provides guidance for selecting confidence estimators for real-world applications. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_08064 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations Xia, Yuxi Ulmer, Dennis Blevins, Terra Liu, Yihong Schütze, Hinrich Roth, Benjamin Computation and Language Confidence estimation (CE) indicates how reliable the answers of large language models are and impacts user trust and decision-making. Existing evaluations mainly concern the alignment between confidence and correctness, but ignore the variability of language: confidence estimates should remain consistent under semantically equivalent prompts or answer variations, while changing when answer meaning differs, as this may indicate a change in correctness. Therefore, we introduce a novel evaluation framework based on three complementary properties: \textbf{robustness} to prompt perturbations, \textbf{stability} across semantically equivalent answers, and \textbf{sensitivity} to semantically different answers. We show that these metrics are largely independent from existing CE metrics, and that common CE methods often fail on them: while most methods achieve high robustness and stability, they struggle to distinguish semantically different answers, potentially because they do not effectively leverage generation-side information. Overall, our framework exposes overlooked limitations of current CE evaluations and provides guidance for selecting confidence estimators for real-world applications. |
| title | Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2601.08064 |