Saved in:
| Main Authors: | Nagarkar, Crish, Bogachev, Leonid, Sharoff, Serge |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.14479 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
by: Khallaf, Nouran, et al.
Published: (2026)
by: Khallaf, Nouran, et al.
Published: (2026)
To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise
by: Khallaf, Nouran, et al.
Published: (2026)
by: Khallaf, Nouran, et al.
Published: (2026)
Reading Between the Lines: A dataset and a study on why some texts are tougher than others
by: Khallaf, Nouran, et al.
Published: (2025)
by: Khallaf, Nouran, et al.
Published: (2025)
Controlling Out-of-Domain Gaps in LLMs for Genre Classification and Generated Text Detection
by: Roussinov, Dmitri, et al.
Published: (2024)
by: Roussinov, Dmitri, et al.
Published: (2024)
Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification
by: Hilasaca, Kenji, et al.
Published: (2026)
by: Hilasaca, Kenji, et al.
Published: (2026)
Almost Clinical: Linguistic properties of synthetic electronic health records
by: Sharoff, Serge, et al.
Published: (2026)
by: Sharoff, Serge, et al.
Published: (2026)
Can LLMs Reason About Trust?: A Pilot Study
by: Debnath, Anushka, et al.
Published: (2025)
by: Debnath, Anushka, et al.
Published: (2025)
Can We Trust LLM Detectors?
by: Sandhan, Jivnesh, et al.
Published: (2026)
by: Sandhan, Jivnesh, et al.
Published: (2026)
Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge
by: Schroeder, Kayla, et al.
Published: (2024)
by: Schroeder, Kayla, et al.
Published: (2024)
More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning
by: Shafiei, Mohammadamin, et al.
Published: (2025)
by: Shafiei, Mohammadamin, et al.
Published: (2025)
When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation
by: Badawi, Abeer, et al.
Published: (2025)
by: Badawi, Abeer, et al.
Published: (2025)
Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation
by: Karakaş, Sercan, et al.
Published: (2026)
by: Karakaş, Sercan, et al.
Published: (2026)
LLM-REVal: Can We Trust LLM Reviewers Yet?
by: Li, Rui, et al.
Published: (2025)
by: Li, Rui, et al.
Published: (2025)
Trust Modeling in Counseling Conversations: A Benchmark Study
by: Srivastava, Aseem, et al.
Published: (2025)
by: Srivastava, Aseem, et al.
Published: (2025)
LLM for Complex Reasoning Task: An Exploratory Study in Fermi Problems
by: Liu, Zishuo, et al.
Published: (2025)
by: Liu, Zishuo, et al.
Published: (2025)
Using Contrastive Learning to Improve Two-Way Reasoning in Large Language Models: The Obfuscation Task as a Case Study
by: Nikiema, Serge Lionel, et al.
Published: (2025)
by: Nikiema, Serge Lionel, et al.
Published: (2025)
Human or LLM as Standardized Patients? A Comparative Study for Medical Education
by: Zhang, Bingquan, et al.
Published: (2025)
by: Zhang, Bingquan, et al.
Published: (2025)
Can Small Models Reason About Legal Documents? A Comparative Study
by: Vaddi, Snehit
Published: (2026)
by: Vaddi, Snehit
Published: (2026)
Navigating Rifts in Human-LLM Grounding: Study and Benchmark
by: Shaikh, Omar, et al.
Published: (2025)
by: Shaikh, Omar, et al.
Published: (2025)
When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment
by: Ferrer, Robinson, et al.
Published: (2026)
by: Ferrer, Robinson, et al.
Published: (2026)
Assessing Gender Bias in LLMs: Comparing LLM Outputs with Human Perceptions and Official Statistics
by: Bas, Tetiana
Published: (2024)
by: Bas, Tetiana
Published: (2024)
Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task
by: Qiu, Mengyang, et al.
Published: (2025)
by: Qiu, Mengyang, et al.
Published: (2025)
Characterizing Knowledge Graph Tasks in LLM Benchmarks Using Cognitive Complexity Frameworks
by: Todorovikj, Sara, et al.
Published: (2025)
by: Todorovikj, Sara, et al.
Published: (2025)
LLM or Human? Perceptions of Trust and Information Quality in Research Summaries
by: Akpinar, Nil-Jana, et al.
Published: (2026)
by: Akpinar, Nil-Jana, et al.
Published: (2026)
Can We Trust LLMs on Memristors? Diving into Reasoning Ability under Non-Ideality
by: Wu, Taiqiang, et al.
Published: (2026)
by: Wu, Taiqiang, et al.
Published: (2026)
Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations
by: Ebubechukwu, Ike, et al.
Published: (2024)
by: Ebubechukwu, Ike, et al.
Published: (2024)
ATR-Bench: A Federated Learning Benchmark for Adaptation, Trust, and Reasoning
by: Ashraf, Tajamul, et al.
Published: (2025)
by: Ashraf, Tajamul, et al.
Published: (2025)
Can Large Language Model Agents Simulate Human Trust Behavior?
by: Xie, Chengxing, et al.
Published: (2024)
by: Xie, Chengxing, et al.
Published: (2024)
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
by: Wang, Leyao, et al.
Published: (2026)
by: Wang, Leyao, et al.
Published: (2026)
Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild
by: Mireshghallah, Niloofar, et al.
Published: (2024)
by: Mireshghallah, Niloofar, et al.
Published: (2024)
Irony in Emojis: A Comparative Study of Human and LLM Interpretation
by: Zheng, Yawen, et al.
Published: (2025)
by: Zheng, Yawen, et al.
Published: (2025)
Semi-structured LLM Reasoners Can Be Rigorously Audited
by: Leng, Jixuan, et al.
Published: (2025)
by: Leng, Jixuan, et al.
Published: (2025)
LLM Output Detectability and Task Performance Can be Jointly Optimized
by: Saito, Koshiro, et al.
Published: (2026)
by: Saito, Koshiro, et al.
Published: (2026)
ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room
by: Mehandru, Nikita, et al.
Published: (2025)
by: Mehandru, Nikita, et al.
Published: (2025)
Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
by: Prandi, Matteo, et al.
Published: (2025)
by: Prandi, Matteo, et al.
Published: (2025)
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
by: Jung, Jaehun, et al.
Published: (2024)
by: Jung, Jaehun, et al.
Published: (2024)
Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks
by: Ganguly, Debargha, et al.
Published: (2025)
by: Ganguly, Debargha, et al.
Published: (2025)
A LLM Benchmark based on the Minecraft Builder Dialog Agent Task
by: Madge, Chris, et al.
Published: (2024)
by: Madge, Chris, et al.
Published: (2024)
Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization
by: Zhou, Jin Peng, et al.
Published: (2024)
by: Zhou, Jin Peng, et al.
Published: (2024)
ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts
by: Su, Ruiran, et al.
Published: (2025)
by: Su, Ruiran, et al.
Published: (2025)
Similar Items
-
How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
by: Khallaf, Nouran, et al.
Published: (2026) -
To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise
by: Khallaf, Nouran, et al.
Published: (2026) -
Reading Between the Lines: A dataset and a study on why some texts are tougher than others
by: Khallaf, Nouran, et al.
Published: (2025) -
Controlling Out-of-Domain Gaps in LLMs for Genre Classification and Generated Text Detection
by: Roussinov, Dmitri, et al.
Published: (2024) -
Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification
by: Hilasaca, Kenji, et al.
Published: (2026)