Saved in:
| Main Authors: | Miao, Zhongjian, Fu, Hao, Wei, Chen |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.09993 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Benchmarking Real-Time Question Answering via Executable Code Workflows
by: Zhou, Wenjie, et al.
Published: (2026)
by: Zhou, Wenjie, et al.
Published: (2026)
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
by: Chen, Hao-Yuan
Published: (2026)
by: Chen, Hao-Yuan
Published: (2026)
LTLBench: Towards Benchmarks for Evaluating Temporal Reasoning in Large Language Models
by: Tang, Weizhi, et al.
Published: (2024)
by: Tang, Weizhi, et al.
Published: (2024)
Rethinking and Benchmarking Large Language Models for Graph Reasoning
by: Hu, Yuwei, et al.
Published: (2025)
by: Hu, Yuwei, et al.
Published: (2025)
Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank
by: Liu, Jiayu, et al.
Published: (2025)
by: Liu, Jiayu, et al.
Published: (2025)
Benchmarking Reasoning Robustness in Large Language Models
by: Yu, Tong, et al.
Published: (2025)
by: Yu, Tong, et al.
Published: (2025)
Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives
by: Zhang, Xinliang Frederick, et al.
Published: (2024)
by: Zhang, Xinliang Frederick, et al.
Published: (2024)
Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering
by: Hu, Zhongjian, et al.
Published: (2024)
by: Hu, Zhongjian, et al.
Published: (2024)
LongReasonArena: A Long Reasoning Benchmark for Large Language Models
by: Ding, Jiayu, et al.
Published: (2025)
by: Ding, Jiayu, et al.
Published: (2025)
SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models
by: Yang, Wanqi, et al.
Published: (2025)
by: Yang, Wanqi, et al.
Published: (2025)
Improving Arithmetic Reasoning Ability of Large Language Models through Relation Tuples, Verification and Dynamic Feedback
by: Miao, Zhongtao, et al.
Published: (2024)
by: Miao, Zhongtao, et al.
Published: (2024)
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
by: Fu, Ling, et al.
Published: (2024)
by: Fu, Ling, et al.
Published: (2024)
AdapTime: Enabling Adaptive Temporal Reasoning in Large Language Models
by: Deng, Yimin, et al.
Published: (2026)
by: Deng, Yimin, et al.
Published: (2026)
Hypothesis Testing Prompting Improves Deductive Reasoning in Large Language Models
by: Li, Yitian, et al.
Published: (2024)
by: Li, Yitian, et al.
Published: (2024)
Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework
by: Jiang, Zihao, et al.
Published: (2025)
by: Jiang, Zihao, et al.
Published: (2025)
Enhance Reasoning for Large Language Models in the Game Werewolf
by: Wu, Shuang, et al.
Published: (2024)
by: Wu, Shuang, et al.
Published: (2024)
CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
by: Zhang, Yongheng, et al.
Published: (2025)
by: Zhang, Yongheng, et al.
Published: (2025)
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models
by: Zhu, Yakun, et al.
Published: (2025)
by: Zhu, Yakun, et al.
Published: (2025)
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
by: Xie, Zhifei, et al.
Published: (2025)
by: Xie, Zhifei, et al.
Published: (2025)
Large Language Models-guided Dynamic Adaptation for Temporal Knowledge Graph Reasoning
by: Wang, Jiapu, et al.
Published: (2024)
by: Wang, Jiapu, et al.
Published: (2024)
Toward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning
by: Zhang, Zhongjian, et al.
Published: (2026)
by: Zhang, Zhongjian, et al.
Published: (2026)
MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science
by: Zhang, Junkai, et al.
Published: (2025)
by: Zhang, Junkai, et al.
Published: (2025)
Can Large Language Models Improve the Adversarial Robustness of Graph Neural Networks?
by: Zhang, Zhongjian, et al.
Published: (2024)
by: Zhang, Zhongjian, et al.
Published: (2024)
CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models
by: Li, Jingyao, et al.
Published: (2025)
by: Li, Jingyao, et al.
Published: (2025)
Data-centric Federated Graph Learning with Large Language Models
by: Yan, Bo, et al.
Published: (2025)
by: Yan, Bo, et al.
Published: (2025)
Spatial Reasoning in Multimodal Large Language Models: A Survey of Tasks, Benchmarks and Methods
by: Liu, Weichen, et al.
Published: (2025)
by: Liu, Weichen, et al.
Published: (2025)
What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?
by: Bhatia, Gagan, et al.
Published: (2026)
by: Bhatia, Gagan, et al.
Published: (2026)
EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering and Reasoning
by: Wei, Mingyang, et al.
Published: (2026)
by: Wei, Mingyang, et al.
Published: (2026)
Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning
by: Zhang, Jianyi, et al.
Published: (2025)
by: Zhang, Jianyi, et al.
Published: (2025)
A Survey of Scaling in Large Language Model Reasoning
by: Chen, Zihan, et al.
Published: (2025)
by: Chen, Zihan, et al.
Published: (2025)
Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games
by: Fan, Mingyuan, et al.
Published: (2026)
by: Fan, Mingyuan, et al.
Published: (2026)
A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models
by: Chang, Ching, et al.
Published: (2025)
by: Chang, Ching, et al.
Published: (2025)
TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models
by: Chu, Zheng, et al.
Published: (2023)
by: Chu, Zheng, et al.
Published: (2023)
Position: Theory of Mind Benchmarks are Broken for Large Language Models
by: Riemer, Matthew, et al.
Published: (2024)
by: Riemer, Matthew, et al.
Published: (2024)
Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study
by: Yao, Xuan, et al.
Published: (2025)
by: Yao, Xuan, et al.
Published: (2025)
Large Language Models for Classical Chinese Poetry Translation: Benchmarking, Evaluating, and Improving
by: Chen, Andong, et al.
Published: (2024)
by: Chen, Andong, et al.
Published: (2024)
DateLogicQA: Benchmarking Temporal Biases in Large Language Models
by: Bhatia, Gagan, et al.
Published: (2024)
by: Bhatia, Gagan, et al.
Published: (2024)
HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning
by: Yang, Qihao, et al.
Published: (2025)
by: Yang, Qihao, et al.
Published: (2025)
Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring
by: Guan, Weixin, et al.
Published: (2026)
by: Guan, Weixin, et al.
Published: (2026)
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
by: Li, Pengfeng, et al.
Published: (2026)
by: Li, Pengfeng, et al.
Published: (2026)
Similar Items
-
Benchmarking Real-Time Question Answering via Executable Code Workflows
by: Zhou, Wenjie, et al.
Published: (2026) -
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
by: Chen, Hao-Yuan
Published: (2026) -
LTLBench: Towards Benchmarks for Evaluating Temporal Reasoning in Large Language Models
by: Tang, Weizhi, et al.
Published: (2024) -
Rethinking and Benchmarking Large Language Models for Graph Reasoning
by: Hu, Yuwei, et al.
Published: (2025) -
Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank
by: Liu, Jiayu, et al.
Published: (2025)