Saved in:
| Main Authors: | Backlund, Axel, Petersson, Lukas |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.15840 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Blueprint-Bench: Comparing spatial intelligence of LLMs, agents and image models
by: Petersson, Lukas, et al.
Published: (2025)
by: Petersson, Lukas, et al.
Published: (2025)
Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence
by: Sharrock, Callum, et al.
Published: (2025)
by: Sharrock, Callum, et al.
Published: (2025)
Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents
by: Shen, Yiting, et al.
Published: (2026)
by: Shen, Yiting, et al.
Published: (2026)
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
by: He, Muyu, et al.
Published: (2026)
by: He, Muyu, et al.
Published: (2026)
VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
by: Chen, Yuhao, et al.
Published: (2026)
by: Chen, Yuhao, et al.
Published: (2026)
StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns
by: Wan, Luanbo, et al.
Published: (2025)
by: Wan, Luanbo, et al.
Published: (2025)
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
by: Li, Keyu, et al.
Published: (2026)
by: Li, Keyu, et al.
Published: (2026)
TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios
by: Shen, Yuanzhe, et al.
Published: (2026)
by: Shen, Yuanzhe, et al.
Published: (2026)
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering
by: Qiu, Jielin, et al.
Published: (2025)
by: Qiu, Jielin, et al.
Published: (2025)
VitalBench: A Rigorous Multi-Center Benchmark for Long-Term Vital Sign Prediction in Intraoperative Care
by: Cai, Xiuding, et al.
Published: (2025)
by: Cai, Xiuding, et al.
Published: (2025)
DCA-Bench: A Benchmark for Dataset Curation Agents
by: Huang, Benhao, et al.
Published: (2024)
by: Huang, Benhao, et al.
Published: (2024)
MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models
by: Tang, Zecheng, et al.
Published: (2026)
by: Tang, Zecheng, et al.
Published: (2026)
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
by: Dong, Haonan, et al.
Published: (2026)
by: Dong, Haonan, et al.
Published: (2026)
LongGenBench: Long-context Generation Benchmark
by: Liu, Xiang, et al.
Published: (2024)
by: Liu, Xiang, et al.
Published: (2024)
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents
by: Bei, Yuanchen, et al.
Published: (2026)
by: Bei, Yuanchen, et al.
Published: (2026)
PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent
by: Nie, Hongyi, et al.
Published: (2026)
by: Nie, Hongyi, et al.
Published: (2026)
LifeBench: A Benchmark for Long-Horizon Multi-Source Memory
by: Cheng, Zihao, et al.
Published: (2026)
by: Cheng, Zihao, et al.
Published: (2026)
KellyBench: A Benchmark for Long-Horizon Sequential Decision Making
by: Grady, Thomas, et al.
Published: (2026)
by: Grady, Thomas, et al.
Published: (2026)
RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments
by: Zhang, Linghua, et al.
Published: (2026)
by: Zhang, Linghua, et al.
Published: (2026)
ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks
by: Song, Yuanyi, et al.
Published: (2025)
by: Song, Yuanyi, et al.
Published: (2025)
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
by: Levy, Ido, et al.
Published: (2024)
by: Levy, Ido, et al.
Published: (2024)
ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support
by: Chen, Tiantian, et al.
Published: (2026)
by: Chen, Tiantian, et al.
Published: (2026)
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
by: Chen, Jingxuan, et al.
Published: (2024)
by: Chen, Jingxuan, et al.
Published: (2024)
ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
by: Imajuku, Yuki, et al.
Published: (2025)
by: Imajuku, Yuki, et al.
Published: (2025)
ProBench: Benchmarking GUI Agents with Accurate Process Information
by: Yang, Leyang, et al.
Published: (2025)
by: Yang, Leyang, et al.
Published: (2025)
SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval
by: Li, Ningyuan, et al.
Published: (2026)
by: Li, Ningyuan, et al.
Published: (2026)
LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios
by: Chen, Tianyu, et al.
Published: (2026)
by: Chen, Tianyu, et al.
Published: (2026)
GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis
by: Yu, Bo, et al.
Published: (2026)
by: Yu, Bo, et al.
Published: (2026)
PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?
by: Pulipaka, Sidharth, et al.
Published: (2026)
by: Pulipaka, Sidharth, et al.
Published: (2026)
Bench to the Future: A Pastcasting Benchmark for Forecasting Agents
by: FutureSearch, et al.
Published: (2025)
by: FutureSearch, et al.
Published: (2025)
CloneMem: Benchmarking Long-Term Memory for AI Clones
by: Hu, Sen, et al.
Published: (2026)
by: Hu, Sen, et al.
Published: (2026)
AgentSearchBench: A Benchmark for AI Agent Search in the Wild
by: Wu, Bin, et al.
Published: (2026)
by: Wu, Bin, et al.
Published: (2026)
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
by: Anokhin, Petr, et al.
Published: (2025)
by: Anokhin, Petr, et al.
Published: (2025)
EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
by: Acuna, Julian
Published: (2026)
by: Acuna, Julian
Published: (2026)
LifeAgentBench: A Multi-dimensional Benchmark and Agent for Personal Health Assistants in Digital Health
by: Tian, Ye, et al.
Published: (2026)
by: Tian, Ye, et al.
Published: (2026)
AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies
by: Ye, Xiao, et al.
Published: (2024)
by: Ye, Xiao, et al.
Published: (2024)
NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents
by: Zheng, Tianshi, et al.
Published: (2025)
by: Zheng, Tianshi, et al.
Published: (2025)
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
by: Im, Youngmin, et al.
Published: (2025)
by: Im, Youngmin, et al.
Published: (2025)
ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents
by: Li, Chao, et al.
Published: (2026)
by: Li, Chao, et al.
Published: (2026)
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
by: Zhou, Yifan, et al.
Published: (2026)
by: Zhou, Yifan, et al.
Published: (2026)
Similar Items
-
Blueprint-Bench: Comparing spatial intelligence of LLMs, agents and image models
by: Petersson, Lukas, et al.
Published: (2025) -
Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence
by: Sharrock, Callum, et al.
Published: (2025) -
Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents
by: Shen, Yiting, et al.
Published: (2026) -
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
by: He, Muyu, et al.
Published: (2026) -
VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
by: Chen, Yuhao, et al.
Published: (2026)