:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Agarwal, Shradha, Rajbhar, Deepak, J, Tariq
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.16675
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AlgBench: To What Extent Do Large Reasoning Models Understand Algorithms?
by: Sun, Henan, et al.
Published: (2026)

Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models
by: Guo, Dadi, et al.
Published: (2025)

CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics
by: Liu, Junqi, et al.
Published: (2025)

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories
by: Kang, Liwei, et al.
Published: (2026)

AlgOS: Algorithm Operating System
by: Salt, Llewyn, et al.
Published: (2025)

Riemann-Bench: A Benchmark for Moonshot Mathematics
by: Garre, Suhaas, et al.
Published: (2026)

Revealing Interpretable Failure Modes of VLMs
by: Chaudhary, Isha, et al.
Published: (2026)

ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization
by: Tso, Joseph, et al.
Published: (2026)

Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking
by: Feuer, Benjamin, et al.
Published: (2024)

Large Language Models and Mathematical Reasoning Failures
by: Boye, Johan, et al.
Published: (2025)

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation
by: Kim, Eunsu, et al.
Published: (2025)

UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models
by: Xu, Xin, et al.
Published: (2025)

HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning
by: Jiang, Zhuohang, et al.
Published: (2025)

CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean
by: Long, Wentao, et al.
Published: (2026)

HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
by: Anokhin, Petr, et al.
Published: (2025)

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
by: Wang, Qinsi, et al.
Published: (2026)

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes
by: Li, Kuan, et al.
Published: (2026)

Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
by: Zheng, Xiang, et al.
Published: (2026)

Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones
by: Teleki, Maria, et al.
Published: (2025)

KG-LLM-Bench: A Scalable Benchmark for Evaluating LLM Reasoning on Textualized Knowledge Graphs
by: Markowitz, Elan, et al.
Published: (2025)

MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning
by: Wang, Xukai, et al.
Published: (2025)

ProcessBench: Identifying Process Errors in Mathematical Reasoning
by: Zheng, Chujie, et al.
Published: (2024)

The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models
by: Kim, Dueun, et al.
Published: (2026)

IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch
by: Biyani, Param, et al.
Published: (2025)

MMR-Bench: A Comprehensive Benchmark for Multimodal LLM Routing
by: Ma, Haoxuan, et al.
Published: (2026)

NC-Bench: An LLM Benchmark for Evaluating Conversational Competence
by: Moore, Robert J., et al.
Published: (2026)

RoMath: A Mathematical Reasoning Benchmark in Romanian
by: Cosma, Adrian, et al.
Published: (2024)

GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration
by: Zhao, Junjie, et al.
Published: (2026)

Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications
by: Vinay, Vaishali
Published: (2025)

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
by: Lee, Hyungyung, et al.
Published: (2025)

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
by: Wang, Zeyu, et al.
Published: (2026)

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
by: Glazer, Elliot, et al.
Published: (2024)

AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence
by: Lubrano, Kate M., et al.
Published: (2026)

LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing
by: Li, Hao, et al.
Published: (2026)

Real-Time Deadlines Reveal Temporal Awareness Failures in LLM Strategic Dialogues
by: Sehgal, Neil K. R., et al.
Published: (2026)

OckBench: Measuring the Efficiency of LLM Reasoning
by: Du, Zheng, et al.
Published: (2025)

TopoBench: Benchmarking LLMs on Hard Topological Reasoning
by: Maniparambil, Mayug, et al.
Published: (2026)

FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis
by: Ondras, Jan, et al.
Published: (2025)

ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams
by: Xu, Qiang, et al.
Published: (2026)

FAM-Bench: A Multimodal Benchmark for Condition-Aware Food-as-Medicine Reasoning
by: Mao, Mingyang, et al.
Published: (2026)