:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Choudhury, Manan Roy, Chandramouli, Adithya, Anand, Mannan, Gupta, Vivek
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.00340
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

FinRule-Bench: A Benchmark for Joint Reasoning over Financial Tables and Principles
by: Malarkkan, Arun Vignesh, et al.
Published: (2026)

CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering
by: Zhao, Yang, et al.
Published: (2025)

DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams
by: Iyengar, Anirudh Iyengar Kaniyar Narayana, et al.
Published: (2026)

Exploiting LLMs' Reasoning Capability to Infer Implicit Concepts in Legal Information Retrieval
by: Nguyen, Hai-Long, et al.
Published: (2024)

CIRCUIT: A Benchmark for Circuit Interpretation and Reasoning Capabilities of LLMs
by: Skelic, Lejla, et al.
Published: (2025)

Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities
by: Bertsch, Amanda, et al.
Published: (2025)

Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges
by: Quan, Pengrui, et al.
Published: (2025)

Do Moral Judgment and Reasoning Capability of LLMs Change with Language? A Study using the Multilingual Defining Issues Test
by: Khandelwal, Aditi, et al.
Published: (2024)

HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning
by: Jiang, Zhuohang, et al.
Published: (2025)

Small Language Models for Agentic Systems: A Survey of Architectures, Capabilities, and Deployment Trade offs
by: Sharma, Raghav, et al.
Published: (2025)

Indian Legal NLP Benchmarks : A Survey
by: Kalamkar, Prathamesh, et al.
Published: (2021)

Auditing Sabotage Bench: A Benchmark for Detecting and Fixing Research Sabotage in ML Codebases
by: Gan, Eric, et al.
Published: (2026)

Benchmarking the Legal Reasoning of LLMs in Arabic Islamic Inheritance Cases
by: AlDahoul, Nouar, et al.
Published: (2025)

Auditing the Ethical Logic of Generative AI Models
by: Neuman, W. Russell, et al.
Published: (2025)

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation
by: Jiang, Zhaoyang, et al.
Published: (2026)

Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data
by: Liu, Xiao, et al.
Published: (2024)

CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning
by: Deason, Lauren, et al.
Published: (2025)

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
by: Gupta, Manan, et al.
Published: (2026)

Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering
by: Gupta, Manan, et al.
Published: (2026)

CHBench: A Cognitive Hierarchy Benchmark for Evaluating Strategic Reasoning Capability of LLMs
by: Liu, Hongtao, et al.
Published: (2025)

Exploring the psychology of LLMs' Moral and Legal Reasoning
by: Almeida, Guilherme F. C. F., et al.
Published: (2023)

Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean
by: Park, Chanwoo, et al.
Published: (2025)

Structured Definitions and Segmentations for Legal Reasoning in LLMs: A Study on Indian Legal Data
by: Khatri, Mann, et al.
Published: (2025)

DIAGRAMS: A Review Framework for Reasoning-Level Attribution in Diagram QA
by: Iyengar, Anirudh Iyengar Kaniyar Narayana, et al.
Published: (2026)

NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models
by: Pandya, Pranshu, et al.
Published: (2024)

Interpretable Emergent Language Using Inter-Agent Transformers
by: Bhardwaj, Mannan
Published: (2025)

CoRe: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks
by: Xie, Danning, et al.
Published: (2025)

Are Your LLMs Capable of Stable Reasoning?
by: Liu, Junnan, et al.
Published: (2024)

REDDIX-NET: A Novel Dataset and Benchmark for Moderating Online Explicit Services
by: Sathvik, MSVPJ, et al.
Published: (2025)

RESOLVE: Relational Reasoning with Symbolic and Object-Level Features Using Vector Symbolic Processing
by: Mejri, Mohamed, et al.
Published: (2024)

Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities
by: Ying, Shuangshuang, et al.
Published: (2026)

NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls
by: Basu, Kinjal, et al.
Published: (2024)

Unequal Voices: How LLMs Construct Constrained Queer Narratives
by: Ghosal, Atreya, et al.
Published: (2025)

Auditing of AI: Legal, Ethical and Technical Approaches
by: Mokander, Jakob
Published: (2024)

Dynamic Reasoning Chains through Depth-Specialized Mixture-of-Experts in Transformer Architectures
by: Roy, Sampurna, et al.
Published: (2025)

From Search to Reasoning: A Five-Level RAG Capability Framework for Enterprise Data
by: Gill, Gurbinder, et al.
Published: (2025)

Explore the Reasoning Capability of LLMs in the Chess Testbed
by: Wang, Shu, et al.
Published: (2024)

ALARB: An Arabic Legal Argument Reasoning Benchmark
by: Shairah, Harethah Abu, et al.
Published: (2025)

EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery
by: Xu, Zelin, et al.
Published: (2026)

Understanding the Geospatial Reasoning Capabilities of LLMs: A Trajectory Recovery Perspective
by: Truong, Thinh Hung, et al.
Published: (2025)