:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Blackwell, Robert E., Barry, Jon, Cohn, Anthony G.
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2410.03492
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Evaluating the Ability of Large Language Models to Reason about Cardinal Directions
by: Cohn, Anthony G, et al.
Published: (2024)

Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited
by: Cohn, Anthony G, et al.
Published: (2025)

Can Large Language Models Reason about the Region Connection Calculus?
by: Cohn, Anthony G, et al.
Published: (2024)

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
by: Cohn, Anthony G., et al.
Published: (2026)

Exploring Spatial Representations in the Historical Lake District Texts with LLM-based Relation Extraction
by: Haris, Erum, et al.
Published: (2024)

Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning
by: Li, Fangjun, et al.
Published: (2024)

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark
by: Li, Fangjun, et al.
Published: (2024)

Towards A Human-in-the-Loop LLM Approach to Collaborative Discourse Analysis
by: Cohn, Clayton, et al.
Published: (2024)

An Empirical Study of Conformal Prediction in LLM with ASP Scaffolds for Robust Reasoning
by: Kaur, Navdeep, et al.
Published: (2025)

To Err Is Human; To Annotate, SILICON? Toward Robust Reproducibility in LLM Annotation
by: Cheng, Xiang, et al.
Published: (2024)

Evaluating Scoring Bias in LLM-as-a-Judge
by: Li, Qingquan, et al.
Published: (2025)

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements
by: Zhao, Bingchen, et al.
Published: (2025)

Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation
by: Thellmann, Klaudia-Doris, et al.
Published: (2026)

Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
by: Cao, Yixin, et al.
Published: (2025)

PredictaBoard: Benchmarking LLM Score Predictability
by: Pacchiardi, Lorenzo, et al.
Published: (2025)

Benchmark^2: Systematic Evaluation of LLM Benchmarks
by: Qian, Qi, et al.
Published: (2026)

Chasing Progress, Not Perfection: Revisiting Strategies for End-to-End LLM Plan Generation
by: Huang, Sukai, et al.
Published: (2024)

Mufu: Multilingual Fused Learning for Low-Resource Translation with LLM
by: Lim, Zheng Wei, et al.
Published: (2024)

NC-Bench: An LLM Benchmark for Evaluating Conversational Competence
by: Moore, Robert J., et al.
Published: (2026)

Tools in the Loop: Quantifying Uncertainty of LLM Question Answering Systems That Use Tools
by: Lymperopoulos, Panagiotis, et al.
Published: (2025)

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring
by: Cacioli, Jon-Paul
Published: (2026)

Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments
by: De la Iglesia, Iker, et al.
Published: (2024)

TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness
by: Zheng, Danna, et al.
Published: (2024)

IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation
by: Tran, Khanh-Tung, et al.
Published: (2025)

Quantifying the Persona Effect in LLM Simulations
by: Hu, Tiancheng, et al.
Published: (2024)

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
by: Laskar, Md Tahmid Rahman, et al.
Published: (2026)

CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring
by: Cohn, Clayton, et al.
Published: (2025)

Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation
by: Lee, Dongryeol, et al.
Published: (2024)

In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
by: Tang, Zeyu, et al.
Published: (2026)

Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals
by: Cacioli, Jon-Paul
Published: (2026)

Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
by: Maekawa, Seiji, et al.
Published: (2025)

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
by: Perlitz, Yotam, et al.
Published: (2024)

LoraxBench: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages
by: Aji, Alham Fikri, et al.
Published: (2025)

Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition
by: Chen, Tiejin, et al.
Published: (2026)

HalluLens: LLM Hallucination Benchmark
by: Bang, Yejin, et al.
Published: (2025)

Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination
by: Agnimo, Yedidia, et al.
Published: (2026)

A Reproducibility Study of LLM-Based Query Reformulation
by: Bigdeli, Amin, et al.
Published: (2026)

Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness
by: Giordano, Luca, et al.
Published: (2025)

Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation
by: Cacioli, Jon-Paul
Published: (2026)

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
by: Atasoy, I. F., et al.
Published: (2026)