:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Bianchi, Federico, Kwon, Yongchan, Izzo, Zachary, Zhang, Linjun, Zou, James
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2512.05925
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
by: Bianchi, Federico, et al.
Published: (2024)

Can AI Be as Creative as Humans?
by: Wang, Haonan, et al.
Published: (2024)

Voice "Cloning" is Style Transfer
by: Zhou, Kaitlyn, et al.
Published: (2026)

ReasonIF: Large Reasoning Models Fail to Follow Instructions During Reasoning
by: Kwon, Yongchan, et al.
Published: (2025)

Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers
by: Kleidermacher, Hannah Calzi, et al.
Published: (2025)

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
by: Zhou, Kaitlyn, et al.
Published: (2026)

What LLMs Think When You Don't Tell Them What to Think About?
by: Kwon, Yongchan, et al.
Published: (2026)

The Role of Ambiguity in Error Prediction via Uncertainty Quantification
by: Staliūnaitė, Ieva Raminta, et al.
Published: (2026)

Automated Benchmark Auditing for AI Agents and Large Language Models
by: Wang, Junlin, et al.
Published: (2026)

Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents
by: Miao, Jiacheng, et al.
Published: (2025)

Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks
by: Kim, Dongjun, et al.
Published: (2025)

Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models
by: Liu, Yi, et al.
Published: (2025)

How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis
by: Bianchi, Federico, et al.
Published: (2024)

TextGrad: Automatic "Differentiation" via Text
by: Yuksekgonul, Mert, et al.
Published: (2024)

FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees
by: Nie, Fan, et al.
Published: (2024)

Understanding Impact of Human Feedback via Influence Functions
by: Min, Taywon, et al.
Published: (2025)

Spot the BlindSpots: Systematic Identification and Quantification of Fine-Grained LLM Biases in Contact Center Summaries
by: Mayilvaghanan, Kawin, et al.
Published: (2025)

Can LLMs Truly Embody Human Personality? Analyzing AI and Human Behavior Alignment in Dispute Resolution
by: Kwon, Deuksin, et al.
Published: (2026)

Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment
by: Kwon, Jea, et al.
Published: (2025)

Belief in the Machine: Investigating Epistemological Blind Spots of Language Models
by: Suzgun, Mirac, et al.
Published: (2024)

SEAL: Systematic Error Analysis for Value ALignment
by: Revel, Manon, et al.
Published: (2024)

CLEAR: Error Analysis via LLM-as-a-Judge Made Easy
by: Yehudai, Asaf, et al.
Published: (2025)

PaperBench: Evaluating AI's Ability to Replicate AI Research
by: Starace, Giulio, et al.
Published: (2025)

Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis
by: Liu, Yunting, et al.
Published: (2024)

Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review
by: Yu, Sungduk, et al.
Published: (2025)

Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review
by: Yu, Sungduk, et al.
Published: (2024)

Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents
by: He, Muyu, et al.
Published: (2025)

SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning
by: Kwon, Daeyong, et al.
Published: (2026)

ReasonOps: Operator Segmentation for LLM Reasoning Traces
by: Lee, Daniel, et al.
Published: (2026)

Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers
by: Heye, David, et al.
Published: (2026)

CEC-Zero: Chinese Error Correction Solution Based on LLM
by: Zhang, Sophie, et al.
Published: (2025)

Systematic Analysis of LLM Contributions to Planning: Solver, Verifier, Heuristic
by: Li, Haoming, et al.
Published: (2024)

A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities
by: Chen, Jiaqi, et al.
Published: (2026)

Enhancing LLM-Based Data Annotation with Error Decomposition
by: Xu, Zhen, et al.
Published: (2026)

Leveraging What's Overfixed: Post-Correction via LLM Grammatical Error Overcorrection
by: Park, Taehee, et al.
Published: (2025)

ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence
by: Wu, Kevin, et al.
Published: (2024)

Identifying Quantum Structure in AI Language: Evidence for Evolutionary Convergence of Human and Artificial Cognition
by: Aerts, Diederik, et al.
Published: (2025)

MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports
by: Kyung, Sunggu, et al.
Published: (2025)

The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches
by: Abeysinghe, Bhashithe, et al.
Published: (2024)

Large Language Models Are Self-Taught Reasoners: Enhancing LLM Applications via Tailored Problem-Solving Demonstrations
by: Ong, Kai Tzu-iunn, et al.
Published: (2024)