Saved in:
| Main Authors: | Balkır, Esma, Pernthaller, Alice, Basaldella, Marco, Hernández-Orallo, José, Collier, Nigel |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.13885 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Multi-agent AI systems outperform human teams in creativity
by: Hu, Tiancheng, et al.
Published: (2026)
by: Hu, Tiancheng, et al.
Published: (2026)
All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning
by: Zhang, Caiqi, et al.
Published: (2025)
by: Zhang, Caiqi, et al.
Published: (2025)
Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents
by: Testini, Irene, et al.
Published: (2025)
by: Testini, Irene, et al.
Published: (2025)
LUQ: Long-text Uncertainty Quantification for LLMs
by: Zhang, Caiqi, et al.
Published: (2024)
by: Zhang, Caiqi, et al.
Published: (2024)
PredictaBoard: Benchmarking LLM Score Predictability
by: Pacchiardi, Lorenzo, et al.
Published: (2025)
by: Pacchiardi, Lorenzo, et al.
Published: (2025)
Handling Ontology Gaps in Semantic Parsing
by: Bacciu, Andrea, et al.
Published: (2024)
by: Bacciu, Andrea, et al.
Published: (2024)
LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations
by: Zhang, Caiqi, et al.
Published: (2025)
by: Zhang, Caiqi, et al.
Published: (2025)
Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs
by: Sivapiromrat, Sanhanat, et al.
Published: (2025)
by: Sivapiromrat, Sanhanat, et al.
Published: (2025)
Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers
by: Pacchiardi, Lorenzo, et al.
Published: (2024)
by: Pacchiardi, Lorenzo, et al.
Published: (2024)
100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances
by: Pacchiardi, Lorenzo, et al.
Published: (2024)
by: Pacchiardi, Lorenzo, et al.
Published: (2024)
Conversational Complexity for Assessing Risk in Large Language Models
by: Burden, John, et al.
Published: (2024)
by: Burden, John, et al.
Published: (2024)
ReasonGraph: Visualisation of Reasoning Paths
by: Li, Zongqian, et al.
Published: (2025)
by: Li, Zongqian, et al.
Published: (2025)
Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging
by: Hu, Tiancheng, et al.
Published: (2025)
by: Hu, Tiancheng, et al.
Published: (2025)
Psychometric Personality Shaping Modulates Capabilities and Safety in Language Models
by: Fitz, Stephen, et al.
Published: (2025)
by: Fitz, Stephen, et al.
Published: (2025)
Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning
by: Huang, Xijie, et al.
Published: (2023)
by: Huang, Xijie, et al.
Published: (2023)
Aligning with Logic: Measuring, Evaluating and Improving Logical Preference Consistency in Large Language Models
by: Liu, Yinhong, et al.
Published: (2024)
by: Liu, Yinhong, et al.
Published: (2024)
Conformity in Large Language Models
by: Zhu, Xiaochen, et al.
Published: (2024)
by: Zhu, Xiaochen, et al.
Published: (2024)
Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis
by: Liu, Yunting, et al.
Published: (2024)
by: Liu, Yunting, et al.
Published: (2024)
Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks
by: Song, Yiliang, et al.
Published: (2026)
by: Song, Yiliang, et al.
Published: (2026)
Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces
by: Pathak, Manas, et al.
Published: (2026)
by: Pathak, Manas, et al.
Published: (2026)
When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning
by: Dong, Yijiang River, et al.
Published: (2025)
by: Dong, Yijiang River, et al.
Published: (2025)
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
by: Liu, Yinhong, et al.
Published: (2024)
by: Liu, Yinhong, et al.
Published: (2024)
Confident RAG: Enhancing the Performance of LLMs for Mathematics Question Answering through Multi-Embedding and Confidence Scoring
by: Chen, Shiting, et al.
Published: (2025)
by: Chen, Shiting, et al.
Published: (2025)
Rank-Then-Score: Enhancing Large Language Models for Automated Essay Scoring
by: Cai, Yida, et al.
Published: (2025)
by: Cai, Yida, et al.
Published: (2025)
PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains
by: Leang, Joshua Ong Jun, et al.
Published: (2025)
by: Leang, Joshua Ong Jun, et al.
Published: (2025)
Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation
by: Lin, Zhen, et al.
Published: (2024)
by: Lin, Zhen, et al.
Published: (2024)
From Deception to Detection: The Dual Roles of Large Language Models in Fake News
by: Sallami, Dorsaf, et al.
Published: (2024)
by: Sallami, Dorsaf, et al.
Published: (2024)
Atomic Calibration of LLMs in Long-Form Generations
by: Zhang, Caiqi, et al.
Published: (2024)
by: Zhang, Caiqi, et al.
Published: (2024)
Fewer Truncations Improve Language Modeling
by: Ding, Hantian, et al.
Published: (2024)
by: Ding, Hantian, et al.
Published: (2024)
Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics
by: Flores, Lorenzo Jaime Yu, et al.
Published: (2025)
by: Flores, Lorenzo Jaime Yu, et al.
Published: (2025)
Learning to Substitute Words with Model-based Score Ranking
by: Liu, Hongye, et al.
Published: (2025)
by: Liu, Hongye, et al.
Published: (2025)
Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
by: Kostić, Bogdan, et al.
Published: (2026)
by: Kostić, Bogdan, et al.
Published: (2026)
Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses
by: Yu, Fangyi, et al.
Published: (2025)
by: Yu, Fangyi, et al.
Published: (2025)
PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims
by: Yoo, Yongmin, et al.
Published: (2025)
by: Yoo, Yongmin, et al.
Published: (2025)
From Confidence to Collapse in LLM Factual Robustness
by: Fastowski, Alina, et al.
Published: (2025)
by: Fastowski, Alina, et al.
Published: (2025)
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
by: Tang, Zeyu, et al.
Published: (2026)
by: Tang, Zeyu, et al.
Published: (2026)
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
by: Kim, Eunsu, et al.
Published: (2024)
by: Kim, Eunsu, et al.
Published: (2024)
Visualizing Uncertainty in Translation Tasks: An Evaluation of LLM Performance and Confidence Metrics
by: Park, Jin Hyun, et al.
Published: (2025)
by: Park, Jin Hyun, et al.
Published: (2025)
LoGU: Long-form Generation with Uncertainty Expressions
by: Yang, Ruihan, et al.
Published: (2024)
by: Yang, Ruihan, et al.
Published: (2024)
Improving Word Translation via Two-Stage Contrastive Learning
by: Li, Yaoyiran, et al.
Published: (2022)
by: Li, Yaoyiran, et al.
Published: (2022)
Similar Items
-
Multi-agent AI systems outperform human teams in creativity
by: Hu, Tiancheng, et al.
Published: (2026) -
All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning
by: Zhang, Caiqi, et al.
Published: (2025) -
Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents
by: Testini, Irene, et al.
Published: (2025) -
LUQ: Long-text Uncertainty Quantification for LLMs
by: Zhang, Caiqi, et al.
Published: (2024) -
PredictaBoard: Benchmarking LLM Score Predictability
by: Pacchiardi, Lorenzo, et al.
Published: (2025)