Saved in:
| Main Authors: | Blackwell, Robert E., Barry, Jon, Cohn, Anthony G. |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.03492 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Evaluating the Ability of Large Language Models to Reason about Cardinal Directions
by: Cohn, Anthony G, et al.
Published: (2024)
by: Cohn, Anthony G, et al.
Published: (2024)
Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited
by: Cohn, Anthony G, et al.
Published: (2025)
by: Cohn, Anthony G, et al.
Published: (2025)
Can Large Language Models Reason about the Region Connection Calculus?
by: Cohn, Anthony G, et al.
Published: (2024)
by: Cohn, Anthony G, et al.
Published: (2024)
QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
by: Cohn, Anthony G., et al.
Published: (2026)
by: Cohn, Anthony G., et al.
Published: (2026)
Exploring Spatial Representations in the Historical Lake District Texts with LLM-based Relation Extraction
by: Haris, Erum, et al.
Published: (2024)
by: Haris, Erum, et al.
Published: (2024)
Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning
by: Li, Fangjun, et al.
Published: (2024)
by: Li, Fangjun, et al.
Published: (2024)
Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark
by: Li, Fangjun, et al.
Published: (2024)
by: Li, Fangjun, et al.
Published: (2024)
Towards A Human-in-the-Loop LLM Approach to Collaborative Discourse Analysis
by: Cohn, Clayton, et al.
Published: (2024)
by: Cohn, Clayton, et al.
Published: (2024)
An Empirical Study of Conformal Prediction in LLM with ASP Scaffolds for Robust Reasoning
by: Kaur, Navdeep, et al.
Published: (2025)
by: Kaur, Navdeep, et al.
Published: (2025)
To Err Is Human; To Annotate, SILICON? Toward Robust Reproducibility in LLM Annotation
by: Cheng, Xiang, et al.
Published: (2024)
by: Cheng, Xiang, et al.
Published: (2024)
Evaluating Scoring Bias in LLM-as-a-Judge
by: Li, Qingquan, et al.
Published: (2025)
by: Li, Qingquan, et al.
Published: (2025)
The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements
by: Zhao, Bingchen, et al.
Published: (2025)
by: Zhao, Bingchen, et al.
Published: (2025)
Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation
by: Thellmann, Klaudia-Doris, et al.
Published: (2026)
by: Thellmann, Klaudia-Doris, et al.
Published: (2026)
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
by: Cao, Yixin, et al.
Published: (2025)
by: Cao, Yixin, et al.
Published: (2025)
PredictaBoard: Benchmarking LLM Score Predictability
by: Pacchiardi, Lorenzo, et al.
Published: (2025)
by: Pacchiardi, Lorenzo, et al.
Published: (2025)
Benchmark^2: Systematic Evaluation of LLM Benchmarks
by: Qian, Qi, et al.
Published: (2026)
by: Qian, Qi, et al.
Published: (2026)
Chasing Progress, Not Perfection: Revisiting Strategies for End-to-End LLM Plan Generation
by: Huang, Sukai, et al.
Published: (2024)
by: Huang, Sukai, et al.
Published: (2024)
Mufu: Multilingual Fused Learning for Low-Resource Translation with LLM
by: Lim, Zheng Wei, et al.
Published: (2024)
by: Lim, Zheng Wei, et al.
Published: (2024)
NC-Bench: An LLM Benchmark for Evaluating Conversational Competence
by: Moore, Robert J., et al.
Published: (2026)
by: Moore, Robert J., et al.
Published: (2026)
Tools in the Loop: Quantifying Uncertainty of LLM Question Answering Systems That Use Tools
by: Lymperopoulos, Panagiotis, et al.
Published: (2025)
by: Lymperopoulos, Panagiotis, et al.
Published: (2025)
The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring
by: Cacioli, Jon-Paul
Published: (2026)
by: Cacioli, Jon-Paul
Published: (2026)
Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments
by: De la Iglesia, Iker, et al.
Published: (2024)
by: De la Iglesia, Iker, et al.
Published: (2024)
TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness
by: Zheng, Danna, et al.
Published: (2024)
by: Zheng, Danna, et al.
Published: (2024)
IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation
by: Tran, Khanh-Tung, et al.
Published: (2025)
by: Tran, Khanh-Tung, et al.
Published: (2025)
Quantifying the Persona Effect in LLM Simulations
by: Hu, Tiancheng, et al.
Published: (2024)
by: Hu, Tiancheng, et al.
Published: (2024)
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
by: Laskar, Md Tahmid Rahman, et al.
Published: (2026)
by: Laskar, Md Tahmid Rahman, et al.
Published: (2026)
CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring
by: Cohn, Clayton, et al.
Published: (2025)
by: Cohn, Clayton, et al.
Published: (2025)
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation
by: Lee, Dongryeol, et al.
Published: (2024)
by: Lee, Dongryeol, et al.
Published: (2024)
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
by: Tang, Zeyu, et al.
Published: (2026)
by: Tang, Zeyu, et al.
Published: (2026)
Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals
by: Cacioli, Jon-Paul
Published: (2026)
by: Cacioli, Jon-Paul
Published: (2026)
Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
by: Maekawa, Seiji, et al.
Published: (2025)
by: Maekawa, Seiji, et al.
Published: (2025)
Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
by: Perlitz, Yotam, et al.
Published: (2024)
by: Perlitz, Yotam, et al.
Published: (2024)
LoraxBench: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages
by: Aji, Alham Fikri, et al.
Published: (2025)
by: Aji, Alham Fikri, et al.
Published: (2025)
Every Response Counts: Quantifying Uncertainty of LLM-based Multi-Agent Systems through Tensor Decomposition
by: Chen, Tiejin, et al.
Published: (2026)
by: Chen, Tiejin, et al.
Published: (2026)
HalluLens: LLM Hallucination Benchmark
by: Bang, Yejin, et al.
Published: (2025)
by: Bang, Yejin, et al.
Published: (2025)
Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination
by: Agnimo, Yedidia, et al.
Published: (2026)
by: Agnimo, Yedidia, et al.
Published: (2026)
A Reproducibility Study of LLM-Based Query Reformulation
by: Bigdeli, Amin, et al.
Published: (2026)
by: Bigdeli, Amin, et al.
Published: (2026)
Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness
by: Giordano, Luca, et al.
Published: (2025)
by: Giordano, Luca, et al.
Published: (2025)
Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation
by: Cacioli, Jon-Paul
Published: (2026)
by: Cacioli, Jon-Paul
Published: (2026)
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
by: Atasoy, I. F., et al.
Published: (2026)
by: Atasoy, I. F., et al.
Published: (2026)
Similar Items
-
Evaluating the Ability of Large Language Models to Reason about Cardinal Directions
by: Cohn, Anthony G, et al.
Published: (2024) -
Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited
by: Cohn, Anthony G, et al.
Published: (2025) -
Can Large Language Models Reason about the Region Connection Calculus?
by: Cohn, Anthony G, et al.
Published: (2024) -
QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
by: Cohn, Anthony G., et al.
Published: (2026) -
Exploring Spatial Representations in the Historical Lake District Texts with LLM-based Relation Extraction
by: Haris, Erum, et al.
Published: (2024)