:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yang, Eddie, Wang, Dashun
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2602.11898
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs
by: Allen, Bradley P., et al.
Published: (2025)

The Illusion of Stochasticity in LLMs
by: Gu, Xiangming, et al.
Published: (2026)

Sci2Pol: Evaluating and Fine-tuning LLMs on Scientific-to-Policy Brief Generation
by: Wu, Weimin, et al.
Published: (2025)

Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers
by: Andreev, Nikita, et al.
Published: (2024)

Is LLM an Overconfident Judge? Unveiling the Capabilities of LLMs in Detecting Offensive Language with Annotation Disagreement
by: Lu, Junyu, et al.
Published: (2025)

Quantifying the Benefit of Artificial Intelligence for Scientific Research
by: Gao, Jian, et al.
Published: (2023)

When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements
by: Ju, Tianjie, et al.
Published: (2025)

LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
by: Zhou, Yujun, et al.
Published: (2024)

Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions
by: Rostamkhani, Mohammadmostafa, et al.
Published: (2024)

The Illusion of Certainty: Uncertainty Quantification for LLMs Fails under Ambiguity
by: Tomov, Tim, et al.
Published: (2025)

The Illusion-Illusion: Vision Language Models See Illusions Where There are None
by: Ullman, Tomer
Published: (2024)

The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
by: Janiak, Denis, et al.
Published: (2025)

SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs
by: Chen, Haotian, et al.
Published: (2025)

Leveraging Annotator Disagreement for Text Classification
by: Xu, Jin, et al.
Published: (2024)

Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis
by: Lu, Junyu, et al.
Published: (2026)

SciDA: Scientific Dynamic Assessor of LLMs
by: Zhou, Junting, et al.
Published: (2025)

Pun Unintended: LLMs and the Illusion of Humor Understanding
by: Zangari, Alessandro, et al.
Published: (2025)

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
by: Liu, Yujie, et al.
Published: (2025)

EarthSE: A Benchmark for Evaluating Earth Scientific Exploration Capability of LLMs
by: Xu, Wanghan, et al.
Published: (2025)

Quantifying and Predicting Disagreement in Graded Human Ratings
by: Zhang, Leixin, et al.
Published: (2026)

The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs
by: Han, Pengrui, et al.
Published: (2025)

MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers
by: Tian, Yang, et al.
Published: (2025)

MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors
by: Hikal, Baraa, et al.
Published: (2025)

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models
by: Shahgir, Haz Sameen, et al.
Published: (2024)

The Leaderboard Illusion
by: Singh, Shivalika, et al.
Published: (2025)

NUTMEG: Separating Signal From Noise in Annotator Disagreement
by: Ivey, Jonathan, et al.
Published: (2025)

Do Differences in Values Influence Disagreements in Online Discussions?
by: van der Meer, Michiel, et al.
Published: (2023)

Bridging the Gap: In-Context Learning for Modeling Human Disagreement
by: Muscato, Benedetta, et al.
Published: (2025)

From Disagreement to Understanding: The Case for Ambiguity Detection in NLI
by: Jayaweera, Chathuri, et al.
Published: (2025)

The Illusion of State in State-Space Models
by: Merrill, William, et al.
Published: (2024)

LEGOBench: Scientific Leaderboard Generation Benchmark
by: Singh, Shruti, et al.
Published: (2024)

SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis
by: Cai, Hengxing, et al.
Published: (2024)

The Gray Area: Characterizing Moderator Disagreement on Reddit
by: Alipour, Shayan, et al.
Published: (2026)

Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions
by: Wang, Dingzriui, et al.
Published: (2025)

Small Changes, Large Consequences: Analyzing the Allocational Fairness of LLMs in Hiring Contexts
by: Seshadri, Preethi, et al.
Published: (2025)

Extreme Miscalibration and the Illusion of Adversarial Robustness
by: Raina, Vyas, et al.
Published: (2024)

Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning
by: Javaji, Shashidhar Reddy, et al.
Published: (2025)

Benchmarking LLMs via Uncertainty Quantification
by: Ye, Fanghua, et al.
Published: (2024)

Beyond Consensus: Perspectivist Modeling and Evaluation of Annotator Disagreement in NLP
by: Xu, Yinuo, et al.
Published: (2026)

Disagreement as Data: Reasoning Trace Analytics in Multi-Agent Systems
by: Tajik, Elham, et al.
Published: (2026)