:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Balkır, Esma, Pernthaller, Alice, Basaldella, Marco, Hernández-Orallo, José, Collier, Nigel
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.13885
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Multi-agent AI systems outperform human teams in creativity
by: Hu, Tiancheng, et al.
Published: (2026)

All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning
by: Zhang, Caiqi, et al.
Published: (2025)

Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents
by: Testini, Irene, et al.
Published: (2025)

LUQ: Long-text Uncertainty Quantification for LLMs
by: Zhang, Caiqi, et al.
Published: (2024)

PredictaBoard: Benchmarking LLM Score Predictability
by: Pacchiardi, Lorenzo, et al.
Published: (2025)

Handling Ontology Gaps in Semantic Parsing
by: Bacciu, Andrea, et al.
Published: (2024)

LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations
by: Zhang, Caiqi, et al.
Published: (2025)

Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs
by: Sivapiromrat, Sanhanat, et al.
Published: (2025)

Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers
by: Pacchiardi, Lorenzo, et al.
Published: (2024)

100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances
by: Pacchiardi, Lorenzo, et al.
Published: (2024)

Conversational Complexity for Assessing Risk in Large Language Models
by: Burden, John, et al.
Published: (2024)

ReasonGraph: Visualisation of Reasoning Paths
by: Li, Zongqian, et al.
Published: (2025)

Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging
by: Hu, Tiancheng, et al.
Published: (2025)

Psychometric Personality Shaping Modulates Capabilities and Safety in Language Models
by: Fitz, Stephen, et al.
Published: (2025)

Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning
by: Huang, Xijie, et al.
Published: (2023)

Aligning with Logic: Measuring, Evaluating and Improving Logical Preference Consistency in Large Language Models
by: Liu, Yinhong, et al.
Published: (2024)

Conformity in Large Language Models
by: Zhu, Xiaochen, et al.
Published: (2024)

Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis
by: Liu, Yunting, et al.
Published: (2024)

Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks
by: Song, Yiliang, et al.
Published: (2026)

Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces
by: Pathak, Manas, et al.
Published: (2026)

When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning
by: Dong, Yijiang River, et al.
Published: (2025)

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
by: Liu, Yinhong, et al.
Published: (2024)

Confident RAG: Enhancing the Performance of LLMs for Mathematics Question Answering through Multi-Embedding and Confidence Scoring
by: Chen, Shiting, et al.
Published: (2025)

Rank-Then-Score: Enhancing Large Language Models for Automated Essay Scoring
by: Cai, Yida, et al.
Published: (2025)

PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains
by: Leang, Joshua Ong Jun, et al.
Published: (2025)

Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation
by: Lin, Zhen, et al.
Published: (2024)

From Deception to Detection: The Dual Roles of Large Language Models in Fake News
by: Sallami, Dorsaf, et al.
Published: (2024)

Atomic Calibration of LLMs in Long-Form Generations
by: Zhang, Caiqi, et al.
Published: (2024)

Fewer Truncations Improve Language Modeling
by: Ding, Hantian, et al.
Published: (2024)

Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics
by: Flores, Lorenzo Jaime Yu, et al.
Published: (2025)

Learning to Substitute Words with Model-based Score Ranking
by: Liu, Hongye, et al.
Published: (2025)

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
by: Kostić, Bogdan, et al.
Published: (2026)

Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses
by: Yu, Fangyi, et al.
Published: (2025)

PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims
by: Yoo, Yongmin, et al.
Published: (2025)

From Confidence to Collapse in LLM Factual Robustness
by: Fastowski, Alina, et al.
Published: (2025)

In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
by: Tang, Zeyu, et al.
Published: (2026)

LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
by: Kim, Eunsu, et al.
Published: (2024)

Visualizing Uncertainty in Translation Tasks: An Evaluation of LLM Performance and Confidence Metrics
by: Park, Jin Hyun, et al.
Published: (2025)

LoGU: Long-form Generation with Uncertainty Expressions
by: Yang, Ruihan, et al.
Published: (2024)

Improving Word Translation via Two-Stage Contrastive Learning
by: Li, Yaoyiran, et al.
Published: (2022)