Saved in:
| Main Authors: | Agarwal, Shradha, Rajbhar, Deepak, J, Tariq |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.16675 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AlgBench: To What Extent Do Large Reasoning Models Understand Algorithms?
by: Sun, Henan, et al.
Published: (2026)
by: Sun, Henan, et al.
Published: (2026)
Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models
by: Guo, Dadi, et al.
Published: (2025)
by: Guo, Dadi, et al.
Published: (2025)
CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics
by: Liu, Junqi, et al.
Published: (2025)
by: Liu, Junqi, et al.
Published: (2025)
LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories
by: Kang, Liwei, et al.
Published: (2026)
by: Kang, Liwei, et al.
Published: (2026)
AlgOS: Algorithm Operating System
by: Salt, Llewyn, et al.
Published: (2025)
by: Salt, Llewyn, et al.
Published: (2025)
Riemann-Bench: A Benchmark for Moonshot Mathematics
by: Garre, Suhaas, et al.
Published: (2026)
by: Garre, Suhaas, et al.
Published: (2026)
Revealing Interpretable Failure Modes of VLMs
by: Chaudhary, Isha, et al.
Published: (2026)
by: Chaudhary, Isha, et al.
Published: (2026)
ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization
by: Tso, Joseph, et al.
Published: (2026)
by: Tso, Joseph, et al.
Published: (2026)
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking
by: Feuer, Benjamin, et al.
Published: (2024)
by: Feuer, Benjamin, et al.
Published: (2024)
Large Language Models and Mathematical Reasoning Failures
by: Boye, Johan, et al.
Published: (2025)
by: Boye, Johan, et al.
Published: (2025)
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation
by: Kim, Eunsu, et al.
Published: (2025)
by: Kim, Eunsu, et al.
Published: (2025)
UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models
by: Xu, Xin, et al.
Published: (2025)
by: Xu, Xin, et al.
Published: (2025)
HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning
by: Jiang, Zhuohang, et al.
Published: (2025)
by: Jiang, Zhuohang, et al.
Published: (2025)
CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean
by: Long, Wentao, et al.
Published: (2026)
by: Long, Wentao, et al.
Published: (2026)
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
by: Anokhin, Petr, et al.
Published: (2025)
by: Anokhin, Petr, et al.
Published: (2025)
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
by: Wang, Qinsi, et al.
Published: (2026)
by: Wang, Qinsi, et al.
Published: (2026)
SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes
by: Li, Kuan, et al.
Published: (2026)
by: Li, Kuan, et al.
Published: (2026)
Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
by: Zheng, Xiang, et al.
Published: (2026)
by: Zheng, Xiang, et al.
Published: (2026)
Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones
by: Teleki, Maria, et al.
Published: (2025)
by: Teleki, Maria, et al.
Published: (2025)
KG-LLM-Bench: A Scalable Benchmark for Evaluating LLM Reasoning on Textualized Knowledge Graphs
by: Markowitz, Elan, et al.
Published: (2025)
by: Markowitz, Elan, et al.
Published: (2025)
MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning
by: Wang, Xukai, et al.
Published: (2025)
by: Wang, Xukai, et al.
Published: (2025)
ProcessBench: Identifying Process Errors in Mathematical Reasoning
by: Zheng, Chujie, et al.
Published: (2024)
by: Zheng, Chujie, et al.
Published: (2024)
The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models
by: Kim, Dueun, et al.
Published: (2026)
by: Kim, Dueun, et al.
Published: (2026)
IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch
by: Biyani, Param, et al.
Published: (2025)
by: Biyani, Param, et al.
Published: (2025)
MMR-Bench: A Comprehensive Benchmark for Multimodal LLM Routing
by: Ma, Haoxuan, et al.
Published: (2026)
by: Ma, Haoxuan, et al.
Published: (2026)
NC-Bench: An LLM Benchmark for Evaluating Conversational Competence
by: Moore, Robert J., et al.
Published: (2026)
by: Moore, Robert J., et al.
Published: (2026)
RoMath: A Mathematical Reasoning Benchmark in Romanian
by: Cosma, Adrian, et al.
Published: (2024)
by: Cosma, Adrian, et al.
Published: (2024)
GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration
by: Zhao, Junjie, et al.
Published: (2026)
by: Zhao, Junjie, et al.
Published: (2026)
Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications
by: Vinay, Vaishali
Published: (2025)
by: Vinay, Vaishali
Published: (2025)
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
by: Lee, Hyungyung, et al.
Published: (2025)
by: Lee, Hyungyung, et al.
Published: (2025)
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
by: Wang, Zeyu, et al.
Published: (2026)
by: Wang, Zeyu, et al.
Published: (2026)
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
by: Glazer, Elliot, et al.
Published: (2024)
by: Glazer, Elliot, et al.
Published: (2024)
AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence
by: Lubrano, Kate M., et al.
Published: (2026)
by: Lubrano, Kate M., et al.
Published: (2026)
LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing
by: Li, Hao, et al.
Published: (2026)
by: Li, Hao, et al.
Published: (2026)
Real-Time Deadlines Reveal Temporal Awareness Failures in LLM Strategic Dialogues
by: Sehgal, Neil K. R., et al.
Published: (2026)
by: Sehgal, Neil K. R., et al.
Published: (2026)
OckBench: Measuring the Efficiency of LLM Reasoning
by: Du, Zheng, et al.
Published: (2025)
by: Du, Zheng, et al.
Published: (2025)
TopoBench: Benchmarking LLMs on Hard Topological Reasoning
by: Maniparambil, Mayug, et al.
Published: (2026)
by: Maniparambil, Mayug, et al.
Published: (2026)
FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis
by: Ondras, Jan, et al.
Published: (2025)
by: Ondras, Jan, et al.
Published: (2025)
ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams
by: Xu, Qiang, et al.
Published: (2026)
by: Xu, Qiang, et al.
Published: (2026)
FAM-Bench: A Multimodal Benchmark for Condition-Aware Food-as-Medicine Reasoning
by: Mao, Mingyang, et al.
Published: (2026)
by: Mao, Mingyang, et al.
Published: (2026)
Similar Items
-
AlgBench: To What Extent Do Large Reasoning Models Understand Algorithms?
by: Sun, Henan, et al.
Published: (2026) -
Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models
by: Guo, Dadi, et al.
Published: (2025) -
CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics
by: Liu, Junqi, et al.
Published: (2025) -
LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories
by: Kang, Liwei, et al.
Published: (2026) -
AlgOS: Algorithm Operating System
by: Salt, Llewyn, et al.
Published: (2025)