Saved in:
| Main Authors: | Bianchi, Federico, Kwon, Yongchan, Izzo, Zachary, Zhang, Linjun, Zou, James |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.05925 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
by: Bianchi, Federico, et al.
Published: (2024)
by: Bianchi, Federico, et al.
Published: (2024)
Can AI Be as Creative as Humans?
by: Wang, Haonan, et al.
Published: (2024)
by: Wang, Haonan, et al.
Published: (2024)
Voice "Cloning" is Style Transfer
by: Zhou, Kaitlyn, et al.
Published: (2026)
by: Zhou, Kaitlyn, et al.
Published: (2026)
ReasonIF: Large Reasoning Models Fail to Follow Instructions During Reasoning
by: Kwon, Yongchan, et al.
Published: (2025)
by: Kwon, Yongchan, et al.
Published: (2025)
Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers
by: Kleidermacher, Hannah Calzi, et al.
Published: (2025)
by: Kleidermacher, Hannah Calzi, et al.
Published: (2025)
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
by: Zhou, Kaitlyn, et al.
Published: (2026)
by: Zhou, Kaitlyn, et al.
Published: (2026)
What LLMs Think When You Don't Tell Them What to Think About?
by: Kwon, Yongchan, et al.
Published: (2026)
by: Kwon, Yongchan, et al.
Published: (2026)
The Role of Ambiguity in Error Prediction via Uncertainty Quantification
by: Staliūnaitė, Ieva Raminta, et al.
Published: (2026)
by: Staliūnaitė, Ieva Raminta, et al.
Published: (2026)
Automated Benchmark Auditing for AI Agents and Large Language Models
by: Wang, Junlin, et al.
Published: (2026)
by: Wang, Junlin, et al.
Published: (2026)
Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents
by: Miao, Jiacheng, et al.
Published: (2025)
by: Miao, Jiacheng, et al.
Published: (2025)
Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks
by: Kim, Dongjun, et al.
Published: (2025)
by: Kim, Dongjun, et al.
Published: (2025)
Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models
by: Liu, Yi, et al.
Published: (2025)
by: Liu, Yi, et al.
Published: (2025)
How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis
by: Bianchi, Federico, et al.
Published: (2024)
by: Bianchi, Federico, et al.
Published: (2024)
TextGrad: Automatic "Differentiation" via Text
by: Yuksekgonul, Mert, et al.
Published: (2024)
by: Yuksekgonul, Mert, et al.
Published: (2024)
FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees
by: Nie, Fan, et al.
Published: (2024)
by: Nie, Fan, et al.
Published: (2024)
Understanding Impact of Human Feedback via Influence Functions
by: Min, Taywon, et al.
Published: (2025)
by: Min, Taywon, et al.
Published: (2025)
Spot the BlindSpots: Systematic Identification and Quantification of Fine-Grained LLM Biases in Contact Center Summaries
by: Mayilvaghanan, Kawin, et al.
Published: (2025)
by: Mayilvaghanan, Kawin, et al.
Published: (2025)
Can LLMs Truly Embody Human Personality? Analyzing AI and Human Behavior Alignment in Dispute Resolution
by: Kwon, Deuksin, et al.
Published: (2026)
by: Kwon, Deuksin, et al.
Published: (2026)
Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment
by: Kwon, Jea, et al.
Published: (2025)
by: Kwon, Jea, et al.
Published: (2025)
Belief in the Machine: Investigating Epistemological Blind Spots of Language Models
by: Suzgun, Mirac, et al.
Published: (2024)
by: Suzgun, Mirac, et al.
Published: (2024)
SEAL: Systematic Error Analysis for Value ALignment
by: Revel, Manon, et al.
Published: (2024)
by: Revel, Manon, et al.
Published: (2024)
CLEAR: Error Analysis via LLM-as-a-Judge Made Easy
by: Yehudai, Asaf, et al.
Published: (2025)
by: Yehudai, Asaf, et al.
Published: (2025)
PaperBench: Evaluating AI's Ability to Replicate AI Research
by: Starace, Giulio, et al.
Published: (2025)
by: Starace, Giulio, et al.
Published: (2025)
Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis
by: Liu, Yunting, et al.
Published: (2024)
by: Liu, Yunting, et al.
Published: (2024)
Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review
by: Yu, Sungduk, et al.
Published: (2025)
by: Yu, Sungduk, et al.
Published: (2025)
Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review
by: Yu, Sungduk, et al.
Published: (2024)
by: Yu, Sungduk, et al.
Published: (2024)
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents
by: He, Muyu, et al.
Published: (2025)
by: He, Muyu, et al.
Published: (2025)
SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning
by: Kwon, Daeyong, et al.
Published: (2026)
by: Kwon, Daeyong, et al.
Published: (2026)
ReasonOps: Operator Segmentation for LLM Reasoning Traces
by: Lee, Daniel, et al.
Published: (2026)
by: Lee, Daniel, et al.
Published: (2026)
Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers
by: Heye, David, et al.
Published: (2026)
by: Heye, David, et al.
Published: (2026)
CEC-Zero: Chinese Error Correction Solution Based on LLM
by: Zhang, Sophie, et al.
Published: (2025)
by: Zhang, Sophie, et al.
Published: (2025)
Systematic Analysis of LLM Contributions to Planning: Solver, Verifier, Heuristic
by: Li, Haoming, et al.
Published: (2024)
by: Li, Haoming, et al.
Published: (2024)
A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities
by: Chen, Jiaqi, et al.
Published: (2026)
by: Chen, Jiaqi, et al.
Published: (2026)
Enhancing LLM-Based Data Annotation with Error Decomposition
by: Xu, Zhen, et al.
Published: (2026)
by: Xu, Zhen, et al.
Published: (2026)
Leveraging What's Overfixed: Post-Correction via LLM Grammatical Error Overcorrection
by: Park, Taehee, et al.
Published: (2025)
by: Park, Taehee, et al.
Published: (2025)
ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence
by: Wu, Kevin, et al.
Published: (2024)
by: Wu, Kevin, et al.
Published: (2024)
Identifying Quantum Structure in AI Language: Evidence for Evolutionary Convergence of Human and Artificial Cognition
by: Aerts, Diederik, et al.
Published: (2025)
by: Aerts, Diederik, et al.
Published: (2025)
MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports
by: Kyung, Sunggu, et al.
Published: (2025)
by: Kyung, Sunggu, et al.
Published: (2025)
The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches
by: Abeysinghe, Bhashithe, et al.
Published: (2024)
by: Abeysinghe, Bhashithe, et al.
Published: (2024)
Large Language Models Are Self-Taught Reasoners: Enhancing LLM Applications via Tailored Problem-Solving Demonstrations
by: Ong, Kai Tzu-iunn, et al.
Published: (2024)
by: Ong, Kai Tzu-iunn, et al.
Published: (2024)
Similar Items
-
Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
by: Bianchi, Federico, et al.
Published: (2024) -
Can AI Be as Creative as Humans?
by: Wang, Haonan, et al.
Published: (2024) -
Voice "Cloning" is Style Transfer
by: Zhou, Kaitlyn, et al.
Published: (2026) -
ReasonIF: Large Reasoning Models Fail to Follow Instructions During Reasoning
by: Kwon, Yongchan, et al.
Published: (2025) -
Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers
by: Kleidermacher, Hannah Calzi, et al.
Published: (2025)