Saved in:
| Main Authors: | Sun, Weiwei, Feng, Shengyu, Li, Shanda, Yang, Yiming |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.04310 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CodePDE: An Inference Framework for LLM-driven PDE Solver Generation
by: Li, Shanda, et al.
Published: (2025)
by: Li, Shanda, et al.
Published: (2025)
FrontierCO: Real-World and Large-Scale Evaluation of Machine Learning Solvers for Combinatorial Optimization
by: Feng, Shengyu, et al.
Published: (2025)
by: Feng, Shengyu, et al.
Published: (2025)
OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology
by: Zhou, Chengfeng, et al.
Published: (2025)
by: Zhou, Chengfeng, et al.
Published: (2025)
CASE-Bench: Context-Aware SafEty Benchmark for Large Language Models
by: Sun, Guangzhi, et al.
Published: (2025)
by: Sun, Guangzhi, et al.
Published: (2025)
CoMind: Towards Community-Driven Agents for Machine Learning Engineering
by: Li, Sijie, et al.
Published: (2025)
by: Li, Sijie, et al.
Published: (2025)
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
by: Deng, Shihan, et al.
Published: (2024)
by: Deng, Shihan, et al.
Published: (2024)
CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models
by: LI, Yizhi, et al.
Published: (2024)
by: LI, Yizhi, et al.
Published: (2024)
GraphicBench: A Planning Benchmark for Graphic Design with Language Agents
by: Ki, Dayeon, et al.
Published: (2025)
by: Ki, Dayeon, et al.
Published: (2025)
VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild
by: Inc, Xiaohongshu
Published: (2026)
by: Inc, Xiaohongshu
Published: (2026)
MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents
by: Wang, Shouju, et al.
Published: (2026)
by: Wang, Shouju, et al.
Published: (2026)
MASLegalBench: Benchmarking Multi-Agent Systems in Deductive Legal Reasoning
by: Jing, Huihao, et al.
Published: (2025)
by: Jing, Huihao, et al.
Published: (2025)
Benchmarking Randomized Optimization Algorithms on Binary, Permutation, and Combinatorial Problem Landscapes
by: Odeyemi, Jethro, et al.
Published: (2025)
by: Odeyemi, Jethro, et al.
Published: (2025)
UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models
by: Zheng, Yu, et al.
Published: (2025)
by: Zheng, Yu, et al.
Published: (2025)
TaskBench: Benchmarking Large Language Models for Task Automation
by: Shen, Yongliang, et al.
Published: (2023)
by: Shen, Yongliang, et al.
Published: (2023)
OR-Bench: An Over-Refusal Benchmark for Large Language Models
by: Cui, Justin, et al.
Published: (2024)
by: Cui, Justin, et al.
Published: (2024)
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
by: Tang, Xiangru, et al.
Published: (2025)
by: Tang, Xiangru, et al.
Published: (2025)
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
by: Yan, Yuchen, et al.
Published: (2025)
by: Yan, Yuchen, et al.
Published: (2025)
AlignBench: Benchmarking Chinese Alignment of Large Language Models
by: Liu, Xiao, et al.
Published: (2023)
by: Liu, Xiao, et al.
Published: (2023)
BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting
by: Wang, Zhensheng, et al.
Published: (2026)
by: Wang, Zhensheng, et al.
Published: (2026)
SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth
by: Xing, Wenpeng, et al.
Published: (2025)
by: Xing, Wenpeng, et al.
Published: (2025)
MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models
by: Tang, Zecheng, et al.
Published: (2026)
by: Tang, Zecheng, et al.
Published: (2026)
FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol
by: Zhu, Jie, et al.
Published: (2026)
by: Zhu, Jie, et al.
Published: (2026)
FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
by: Lee, Gyubok, et al.
Published: (2025)
by: Lee, Gyubok, et al.
Published: (2025)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
by: Yang, Rui, et al.
Published: (2025)
by: Yang, Rui, et al.
Published: (2025)
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
by: He, Wei, et al.
Published: (2025)
by: He, Wei, et al.
Published: (2025)
CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models
by: Zhang, Wenjing, et al.
Published: (2024)
by: Zhang, Wenjing, et al.
Published: (2024)
World-Model-Augmented Web Agents with Action Correction
by: Shen, Zhouzhou, et al.
Published: (2026)
by: Shen, Zhouzhou, et al.
Published: (2026)
DataSciBench: An LLM Agent Benchmark for Data Science
by: Zhang, Dan, et al.
Published: (2025)
by: Zhang, Dan, et al.
Published: (2025)
JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models
by: Geng, Saibo, et al.
Published: (2025)
by: Geng, Saibo, et al.
Published: (2025)
TurkBench: A Benchmark for Evaluating Turkish Large Language Models
by: Toraman, Çağrı, et al.
Published: (2026)
by: Toraman, Çağrı, et al.
Published: (2026)
DarkBench: Benchmarking Dark Patterns in Large Language Models
by: Kran, Esben, et al.
Published: (2025)
by: Kran, Esben, et al.
Published: (2025)
RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking
by: Yang, Shuo, et al.
Published: (2025)
by: Yang, Shuo, et al.
Published: (2025)
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
by: Adelani, David Ifeoluwa, et al.
Published: (2024)
by: Adelani, David Ifeoluwa, et al.
Published: (2024)
DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models
by: Huang, Yiming, et al.
Published: (2024)
by: Huang, Yiming, et al.
Published: (2024)
Optimizing Temperature for Language Models with Multi-Sample Inference
by: Du, Weihua, et al.
Published: (2025)
by: Du, Weihua, et al.
Published: (2025)
Self-Play Preference Optimization for Language Model Alignment
by: Wu, Yue, et al.
Published: (2024)
by: Wu, Yue, et al.
Published: (2024)
KLoB: a Benchmark for Assessing Knowledge Locating Methods in Language Models
by: Ju, Yiming, et al.
Published: (2023)
by: Ju, Yiming, et al.
Published: (2023)
SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks
by: Cao, Hongye, et al.
Published: (2025)
by: Cao, Hongye, et al.
Published: (2025)
Training Proactive and Personalized LLM Agents
by: Sun, Weiwei, et al.
Published: (2025)
by: Sun, Weiwei, et al.
Published: (2025)
Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code
by: Jiang, Nan, et al.
Published: (2024)
by: Jiang, Nan, et al.
Published: (2024)
Similar Items
-
CodePDE: An Inference Framework for LLM-driven PDE Solver Generation
by: Li, Shanda, et al.
Published: (2025) -
FrontierCO: Real-World and Large-Scale Evaluation of Machine Learning Solvers for Combinatorial Optimization
by: Feng, Shengyu, et al.
Published: (2025) -
OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology
by: Zhou, Chengfeng, et al.
Published: (2025) -
CASE-Bench: Context-Aware SafEty Benchmark for Large Language Models
by: Sun, Guangzhi, et al.
Published: (2025) -
CoMind: Towards Community-Driven Agents for Machine Learning Engineering
by: Li, Sijie, et al.
Published: (2025)