:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Backlund, Axel, Petersson, Lukas
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.15840
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Blueprint-Bench: Comparing spatial intelligence of LLMs, agents and image models
by: Petersson, Lukas, et al.
Published: (2025)

Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence
by: Sharrock, Callum, et al.
Published: (2025)

Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents
by: Shen, Yiting, et al.
Published: (2026)

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
by: He, Muyu, et al.
Published: (2026)

VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
by: Chen, Yuhao, et al.
Published: (2026)

StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns
by: Wan, Luanbo, et al.
Published: (2025)

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
by: Li, Keyu, et al.
Published: (2026)

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios
by: Shen, Yuanzhe, et al.
Published: (2026)

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering
by: Qiu, Jielin, et al.
Published: (2025)

VitalBench: A Rigorous Multi-Center Benchmark for Long-Term Vital Sign Prediction in Intraoperative Care
by: Cai, Xiuding, et al.
Published: (2025)

DCA-Bench: A Benchmark for Dataset Curation Agents
by: Huang, Benhao, et al.
Published: (2024)

MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models
by: Tang, Zecheng, et al.
Published: (2026)

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
by: Dong, Haonan, et al.
Published: (2026)

LongGenBench: Long-context Generation Benchmark
by: Liu, Xiang, et al.
Published: (2024)

Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents
by: Bei, Yuanchen, et al.
Published: (2026)

PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent
by: Nie, Hongyi, et al.
Published: (2026)

LifeBench: A Benchmark for Long-Horizon Multi-Source Memory
by: Cheng, Zihao, et al.
Published: (2026)

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making
by: Grady, Thomas, et al.
Published: (2026)

RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments
by: Zhang, Linghua, et al.
Published: (2026)

ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks
by: Song, Yuanyi, et al.
Published: (2025)

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
by: Levy, Ido, et al.
Published: (2024)

ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support
by: Chen, Tiantian, et al.
Published: (2026)

SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
by: Chen, Jingxuan, et al.
Published: (2024)

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
by: Imajuku, Yuki, et al.
Published: (2025)

ProBench: Benchmarking GUI Agents with Accurate Process Information
by: Yang, Leyang, et al.
Published: (2025)

SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval
by: Li, Ningyuan, et al.
Published: (2026)

LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios
by: Chen, Tianyu, et al.
Published: (2026)

GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis
by: Yu, Bo, et al.
Published: (2026)

PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?
by: Pulipaka, Sidharth, et al.
Published: (2026)

Bench to the Future: A Pastcasting Benchmark for Forecasting Agents
by: FutureSearch, et al.
Published: (2025)

CloneMem: Benchmarking Long-Term Memory for AI Clones
by: Hu, Sen, et al.
Published: (2026)

AgentSearchBench: A Benchmark for AI Agent Search in the Wild
by: Wu, Bin, et al.
Published: (2026)

HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
by: Anokhin, Petr, et al.
Published: (2025)

EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
by: Acuna, Julian
Published: (2026)

LifeAgentBench: A Multi-dimensional Benchmark and Agent for Personal Health Assistants in Digital Health
by: Tian, Ye, et al.
Published: (2026)

AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies
by: Ye, Xiao, et al.
Published: (2024)

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents
by: Zheng, Tianshi, et al.
Published: (2025)

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
by: Im, Youngmin, et al.
Published: (2025)

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents
by: Li, Chao, et al.
Published: (2026)

SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
by: Zhou, Yifan, et al.
Published: (2026)