:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Nagarkar, Crish, Bogachev, Leonid, Sharoff, Serge
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2601.14479
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
by: Khallaf, Nouran, et al.
Published: (2026)

To Predict or Not to Predict? Towards reliable uncertainty estimation in the presence of noise
by: Khallaf, Nouran, et al.
Published: (2026)

Reading Between the Lines: A dataset and a study on why some texts are tougher than others
by: Khallaf, Nouran, et al.
Published: (2025)

Controlling Out-of-Domain Gaps in LLMs for Genre Classification and Generated Text Detection
by: Roussinov, Dmitri, et al.
Published: (2024)

Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification
by: Hilasaca, Kenji, et al.
Published: (2026)

Almost Clinical: Linguistic properties of synthetic electronic health records
by: Sharoff, Serge, et al.
Published: (2026)

Can LLMs Reason About Trust?: A Pilot Study
by: Debnath, Anushka, et al.
Published: (2025)

Can We Trust LLM Detectors?
by: Sandhan, Jivnesh, et al.
Published: (2026)

Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge
by: Schroeder, Kayla, et al.
Published: (2024)

More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning
by: Shafiei, Mohammadamin, et al.
Published: (2025)

When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation
by: Badawi, Abeer, et al.
Published: (2025)

Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation
by: Karakaş, Sercan, et al.
Published: (2026)

LLM-REVal: Can We Trust LLM Reviewers Yet?
by: Li, Rui, et al.
Published: (2025)

Trust Modeling in Counseling Conversations: A Benchmark Study
by: Srivastava, Aseem, et al.
Published: (2025)

LLM for Complex Reasoning Task: An Exploratory Study in Fermi Problems
by: Liu, Zishuo, et al.
Published: (2025)

Using Contrastive Learning to Improve Two-Way Reasoning in Large Language Models: The Obfuscation Task as a Case Study
by: Nikiema, Serge Lionel, et al.
Published: (2025)

Human or LLM as Standardized Patients? A Comparative Study for Medical Education
by: Zhang, Bingquan, et al.
Published: (2025)

Can Small Models Reason About Legal Documents? A Comparative Study
by: Vaddi, Snehit
Published: (2026)

Navigating Rifts in Human-LLM Grounding: Study and Benchmark
by: Shaikh, Omar, et al.
Published: (2025)

When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment
by: Ferrer, Robinson, et al.
Published: (2026)

Assessing Gender Bias in LLMs: Comparing LLM Outputs with Human Perceptions and Official Statistics
by: Bas, Tetiana
Published: (2024)

Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task
by: Qiu, Mengyang, et al.
Published: (2025)

Characterizing Knowledge Graph Tasks in LLM Benchmarks Using Cognitive Complexity Frameworks
by: Todorovikj, Sara, et al.
Published: (2025)

LLM or Human? Perceptions of Trust and Information Quality in Research Summaries
by: Akpinar, Nil-Jana, et al.
Published: (2026)

Can We Trust LLMs on Memristors? Diving into Reasoning Ability under Non-Ideality
by: Wu, Taiqiang, et al.
Published: (2026)

Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations
by: Ebubechukwu, Ike, et al.
Published: (2024)

ATR-Bench: A Federated Learning Benchmark for Adaptation, Trust, and Reasoning
by: Ashraf, Tajamul, et al.
Published: (2025)

Can Large Language Model Agents Simulate Human Trust Behavior?
by: Xie, Chengxing, et al.
Published: (2024)

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
by: Wang, Leyao, et al.
Published: (2026)

Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild
by: Mireshghallah, Niloofar, et al.
Published: (2024)

Irony in Emojis: A Comparative Study of Human and LLM Interpretation
by: Zheng, Yawen, et al.
Published: (2025)

Semi-structured LLM Reasoners Can Be Rigorously Audited
by: Leng, Jixuan, et al.
Published: (2025)

LLM Output Detectability and Task Performance Can be Jointly Optimized
by: Saito, Koshiro, et al.
Published: (2026)

ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room
by: Mehandru, Nikita, et al.
Published: (2025)

Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
by: Prandi, Matteo, et al.
Published: (2025)

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
by: Jung, Jaehun, et al.
Published: (2024)

Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks
by: Ganguly, Debargha, et al.
Published: (2025)

A LLM Benchmark based on the Minecraft Builder Dialog Agent Task
by: Madge, Chris, et al.
Published: (2024)

Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization
by: Zhou, Jin Peng, et al.
Published: (2024)

ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts
by: Su, Ruiran, et al.
Published: (2025)