Saved in:
| Main Authors: | Zhu, Kexin, Han, Yang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.03477 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science
by: Zhang, Junkai, et al.
Published: (2025)
by: Zhang, Junkai, et al.
Published: (2025)
FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models
by: Liu, Shu, et al.
Published: (2024)
by: Liu, Shu, et al.
Published: (2024)
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
by: Wang, Zekun Moore, et al.
Published: (2023)
by: Wang, Zekun Moore, et al.
Published: (2023)
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
by: Gui, Jiayi, et al.
Published: (2024)
by: Gui, Jiayi, et al.
Published: (2024)
Eliciting Causal Abilities in Large Language Models for Reasoning Tasks
by: Wang, Yajing, et al.
Published: (2024)
by: Wang, Yajing, et al.
Published: (2024)
KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models
by: Kim, Dongjun, et al.
Published: (2025)
by: Kim, Dongjun, et al.
Published: (2025)
HouseTS: A Large-Scale, Multimodal Spatiotemporal U.S. Housing Dataset and Benchmark
by: Wang, Shengkun, et al.
Published: (2025)
by: Wang, Shengkun, et al.
Published: (2025)
Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models
by: Li, Haoyang, et al.
Published: (2025)
by: Li, Haoyang, et al.
Published: (2025)
MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models
by: Ding, Peng, et al.
Published: (2024)
by: Ding, Peng, et al.
Published: (2024)
Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities
by: Purpura, Alberto, et al.
Published: (2026)
by: Purpura, Alberto, et al.
Published: (2026)
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
by: Hu, Tiancheng, et al.
Published: (2025)
by: Hu, Tiancheng, et al.
Published: (2025)
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
by: Kang, Zhaolu, et al.
Published: (2025)
by: Kang, Zhaolu, et al.
Published: (2025)
Large Language Models have Intrinsic Self-Correction Ability
by: Liu, Dancheng, et al.
Published: (2024)
by: Liu, Dancheng, et al.
Published: (2024)
Collaboration among Multiple Large Language Models for Medical Question Answering
by: Shang, Kexin, et al.
Published: (2025)
by: Shang, Kexin, et al.
Published: (2025)
Neuro-Symbolic Artificial Intelligence: Towards Improving the Reasoning Abilities of Large Language Models
by: Yang, Xiao-Wen, et al.
Published: (2025)
by: Yang, Xiao-Wen, et al.
Published: (2025)
Editing Factual Knowledge and Explanatory Ability of Medical Large Language Models
by: Xu, Derong, et al.
Published: (2024)
by: Xu, Derong, et al.
Published: (2024)
Counting Ability of Large Language Models and Impact of Tokenization
by: Zhang, Xiang, et al.
Published: (2024)
by: Zhang, Xiang, et al.
Published: (2024)
ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents
by: Kang, Hao, et al.
Published: (2024)
by: Kang, Hao, et al.
Published: (2024)
Deception Abilities Emerged in Large Language Models
by: Hagendorff, Thilo
Published: (2023)
by: Hagendorff, Thilo
Published: (2023)
Benchmarking Multi-National Value Alignment for Large Language Models
by: Shi, Weijie, et al.
Published: (2025)
by: Shi, Weijie, et al.
Published: (2025)
TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models
by: Li, Ce, et al.
Published: (2025)
by: Li, Ce, et al.
Published: (2025)
ORACLE: Optimizing Reasoning Abilities of Large Language Models via Constraint-Led Synthetic Data Elicitation
by: Yang, Zhuojie, et al.
Published: (2026)
by: Yang, Zhuojie, et al.
Published: (2026)
ScholarGym: Benchmarking Large Language Model Capabilities in the Information-Gathering Stage of Deep Research
by: Shen, Hao, et al.
Published: (2026)
by: Shen, Hao, et al.
Published: (2026)
Investigating Large Language Models' Linguistic Abilities for Text Preprocessing
by: Braga, Marco, et al.
Published: (2025)
by: Braga, Marco, et al.
Published: (2025)
Intent2Tx: Benchmarking LLMs for Translating Natural Language Intents into Ethereum Transactions
by: Pan, Zhuoran, et al.
Published: (2026)
by: Pan, Zhuoran, et al.
Published: (2026)
Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization
by: Liu, Jiacai, et al.
Published: (2024)
by: Liu, Jiacai, et al.
Published: (2024)
QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
by: Puyin, Li, et al.
Published: (2025)
by: Puyin, Li, et al.
Published: (2025)
GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps
by: Nasir, Muhammad Umair, et al.
Published: (2024)
by: Nasir, Muhammad Umair, et al.
Published: (2024)
Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities
by: Rahman, Roussel
Published: (2025)
by: Rahman, Roussel
Published: (2025)
NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes
by: Fan, Lizhou, et al.
Published: (2023)
by: Fan, Lizhou, et al.
Published: (2023)
Self-Evolving Critique Abilities in Large Language Models
by: Tang, Zhengyang, et al.
Published: (2025)
by: Tang, Zhengyang, et al.
Published: (2025)
Emergent Abilities in Large Language Models: A Survey
by: Berti, Leonardo, et al.
Published: (2025)
by: Berti, Leonardo, et al.
Published: (2025)
Medical Large Vision Language Models with Multi-Image Visual Ability
by: Yang, Xikai, et al.
Published: (2025)
by: Yang, Xikai, et al.
Published: (2025)
Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation
by: Elhady, Ahmed, et al.
Published: (2025)
by: Elhady, Ahmed, et al.
Published: (2025)
Psychological Counseling Ability of Large Language Models
by: Peng, Fangyu, et al.
Published: (2025)
by: Peng, Fangyu, et al.
Published: (2025)
Benchmarking Gender and Political Bias in Large Language Models
by: Yang, Jinrui, et al.
Published: (2025)
by: Yang, Jinrui, et al.
Published: (2025)
Belief-Guided Inference Control for Large Language Model Services via Verifiable Observations
by: Yuan, Wenhao, et al.
Published: (2026)
by: Yuan, Wenhao, et al.
Published: (2026)
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
by: Garg, Divyansh, et al.
Published: (2025)
by: Garg, Divyansh, et al.
Published: (2025)
M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark
by: Song, Wei, et al.
Published: (2024)
by: Song, Wei, et al.
Published: (2024)
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
by: Xu, Jingxuan, et al.
Published: (2025)
by: Xu, Jingxuan, et al.
Published: (2025)
Similar Items
-
MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science
by: Zhang, Junkai, et al.
Published: (2025) -
FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models
by: Liu, Shu, et al.
Published: (2024) -
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
by: Wang, Zekun Moore, et al.
Published: (2023) -
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
by: Gui, Jiayi, et al.
Published: (2024) -
Eliciting Causal Abilities in Large Language Models for Reasoning Tasks
by: Wang, Yajing, et al.
Published: (2024)