Saved in:
| Main Authors: | Liu, Yang, Li, Hongming, Qin, Melissa Xiaohui, Liu, Qiankun, Huang, Chao |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.16593 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models
by: Liu, Yang, et al.
Published: (2024)
by: Liu, Yang, et al.
Published: (2024)
Entropy-Based Data Selection for Language Models
by: Li, Hongming, et al.
Published: (2026)
by: Li, Hongming, et al.
Published: (2026)
PJB: A Reasoning-Aware Benchmark for Person-Job Retrieval
by: Wang, Guangzhi, et al.
Published: (2026)
by: Wang, Guangzhi, et al.
Published: (2026)
MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models
by: Cai, Yuanqing, et al.
Published: (2026)
by: Cai, Yuanqing, et al.
Published: (2026)
AstroMind: A High-Fidelity Benchmark for Spacecraft Behavior Reasoning Based on Large Language Models
by: Liu, Hao, et al.
Published: (2026)
by: Liu, Hao, et al.
Published: (2026)
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models
by: Zhu, Yakun, et al.
Published: (2025)
by: Zhu, Yakun, et al.
Published: (2025)
GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models
by: Yuan, Zike, et al.
Published: (2024)
by: Yuan, Zike, et al.
Published: (2024)
Revisiting Model Interpolation for Efficient Reasoning
by: Wu, Taiqiang, et al.
Published: (2025)
by: Wu, Taiqiang, et al.
Published: (2025)
EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking
by: Wei, Anjiang, et al.
Published: (2025)
by: Wei, Anjiang, et al.
Published: (2025)
TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models
by: Li, Ce, et al.
Published: (2025)
by: Li, Ce, et al.
Published: (2025)
SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation
by: He, Hangfeng, et al.
Published: (2023)
by: He, Hangfeng, et al.
Published: (2023)
LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?
by: Wang, Jingyuan, et al.
Published: (2025)
by: Wang, Jingyuan, et al.
Published: (2025)
Com$^2$: A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models
by: Xiong, Kai, et al.
Published: (2025)
by: Xiong, Kai, et al.
Published: (2025)
Conceptual and Unbiased Reasoning in Language Models
by: Zhou, Ben, et al.
Published: (2024)
by: Zhou, Ben, et al.
Published: (2024)
AbsPyramid: Benchmarking the Abstraction Ability of Language Models with a Unified Entailment Graph
by: Wang, Zhaowei, et al.
Published: (2023)
by: Wang, Zhaowei, et al.
Published: (2023)
Abstraction-of-Thought Makes Language Models Better Reasoners
by: Hong, Ruixin, et al.
Published: (2024)
by: Hong, Ruixin, et al.
Published: (2024)
Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study
by: Yao, Xuan, et al.
Published: (2025)
by: Yao, Xuan, et al.
Published: (2025)
HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
by: Wei, Yanbin, et al.
Published: (2026)
by: Wei, Yanbin, et al.
Published: (2026)
MONICA: Real-Time Monitoring and Calibration of Chain-of-Thought Sycophancy in Large Reasoning Models
by: Hu, Jingyu, et al.
Published: (2025)
by: Hu, Jingyu, et al.
Published: (2025)
Large Language Models Are Cross-Lingual Knowledge-Free Reasoners
by: Hu, Peng, et al.
Published: (2024)
by: Hu, Peng, et al.
Published: (2024)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning
by: Hong, Ruixin, et al.
Published: (2023)
by: Hong, Ruixin, et al.
Published: (2023)
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
by: Gui, Jiayi, et al.
Published: (2024)
by: Gui, Jiayi, et al.
Published: (2024)
CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models
by: Zhang, Tong, et al.
Published: (2024)
by: Zhang, Tong, et al.
Published: (2024)
Implicit Reasoning in Large Language Models: A Comprehensive Survey
by: Li, Jindong, et al.
Published: (2025)
by: Li, Jindong, et al.
Published: (2025)
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation
by: Zhu, Qin, et al.
Published: (2024)
by: Zhu, Qin, et al.
Published: (2024)
BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models
by: Chen, Jiangxi, et al.
Published: (2026)
by: Chen, Jiangxi, et al.
Published: (2026)
Revisiting Knowledge Distillation for Autoregressive Language Models
by: Zhong, Qihuang, et al.
Published: (2024)
by: Zhong, Qihuang, et al.
Published: (2024)
ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains
by: Huang, Zhaopei, et al.
Published: (2024)
by: Huang, Zhaopei, et al.
Published: (2024)
UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models
by: Xu, Xin, et al.
Published: (2025)
by: Xu, Xin, et al.
Published: (2025)
Revisiting the Reliability of Language Models in Instruction-Following
by: Dong, Jianshuo, et al.
Published: (2025)
by: Dong, Jianshuo, et al.
Published: (2025)
Meta-Reasoning: Semantics-Symbol Deconstruction for Large Language Models
by: Wang, Yiming, et al.
Published: (2023)
by: Wang, Yiming, et al.
Published: (2023)
Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting
by: Knappe, Tim, et al.
Published: (2024)
by: Knappe, Tim, et al.
Published: (2024)
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
by: Guo, Zhicheng, et al.
Published: (2024)
by: Guo, Zhicheng, et al.
Published: (2024)
EconLogicQA: A Question-Answering Benchmark for Evaluating Large Language Models in Economic Sequential Reasoning
by: Quan, Yinzhu, et al.
Published: (2024)
by: Quan, Yinzhu, et al.
Published: (2024)
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation
by: Zeng, Zhongshen, et al.
Published: (2023)
by: Zeng, Zhongshen, et al.
Published: (2023)
SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models
by: Yang, Wanqi, et al.
Published: (2025)
by: Yang, Wanqi, et al.
Published: (2025)
LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment
by: Yang, Ge, et al.
Published: (2024)
by: Yang, Ge, et al.
Published: (2024)
GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models
by: Sun, Zhouhao, et al.
Published: (2026)
by: Sun, Zhouhao, et al.
Published: (2026)
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns
by: Li, Xiang, et al.
Published: (2025)
by: Li, Xiang, et al.
Published: (2025)
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
by: Lin, Shi, et al.
Published: (2024)
by: Lin, Shi, et al.
Published: (2024)
Similar Items
-
Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models
by: Liu, Yang, et al.
Published: (2024) -
Entropy-Based Data Selection for Language Models
by: Li, Hongming, et al.
Published: (2026) -
PJB: A Reasoning-Aware Benchmark for Person-Job Retrieval
by: Wang, Guangzhi, et al.
Published: (2026) -
MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models
by: Cai, Yuanqing, et al.
Published: (2026) -
AstroMind: A High-Fidelity Benchmark for Spacecraft Behavior Reasoning Based on Large Language Models
by: Liu, Hao, et al.
Published: (2026)