:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Diks, Ian, Muralidharan, Harihara, Proctor, Tim, Workman, Kenny
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.28065
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SpatialBench: Can Agents Analyze Real-World Spatial Biology Data?
by: Workman, Kenny, et al.
Published: (2025)

scBench: Evaluating AI Agents on Single-Cell RNA-seq Analysis
by: Workman, Kenny, et al.
Published: (2026)

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
by: Zhang, Yinger, et al.
Published: (2026)

NaturalGAIA: A Verifiable Benchmark and Hierarchical Framework for Long-Horizon GUI Tasks
by: Zheng, Zihan, et al.
Published: (2025)

UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios
by: Luo, Haotian, et al.
Published: (2025)

Spatially Grounded Long-Horizon Task Planning in the Wild
by: Jung, Sehun, et al.
Published: (2026)

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
by: Motwani, Sumeet Ramesh, et al.
Published: (2026)

RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
by: Zhang, Zijing, et al.
Published: (2025)

LifeBench: A Benchmark for Long-Horizon Multi-Source Memory
by: Cheng, Zihao, et al.
Published: (2026)

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks
by: Jiang, Tanqiu, et al.
Published: (2026)

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making
by: Grady, Thomas, et al.
Published: (2026)

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue
by: Lin, Jingjie, et al.
Published: (2026)

MinePlanner: A Benchmark for Long-Horizon Planning in Large Minecraft Worlds
by: Hill, William, et al.
Published: (2023)

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
by: Imajuku, Yuki, et al.
Published: (2025)

LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability
by: Xiao, Zikai, et al.
Published: (2025)

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
by: Jiayang, Cheng, et al.
Published: (2026)

HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
by: Anokhin, Petr, et al.
Published: (2025)

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents
by: Yan, Lu, et al.
Published: (2026)

Can LLM Agents Be CFOs? Benchmarking Long-Horizon Resource Allocation in an Uncertain Enterprise Environment
by: Han, Yi, et al.
Published: (2026)

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution
by: Zhu, Zilin, et al.
Published: (2026)

VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains
by: Li, Xuzhao, et al.
Published: (2025)

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
by: Yang, Yue, et al.
Published: (2025)

HorizonBench: Long-Horizon Personalization with Evolving Preferences
by: Li, Shuyue Stella, et al.
Published: (2026)

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios
by: Shen, Yuanzhe, et al.
Published: (2026)

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
by: Li, Junlong, et al.
Published: (2025)

λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics
by: Jaafar, Ahmed, et al.
Published: (2024)

SpatialMem: Metric-Aligned Long-Horizon Video Memory for Language Grounding and QA
by: Zheng, Xinyi, et al.
Published: (2026)

ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks
by: Song, Yuanyi, et al.
Published: (2025)

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
by: Lu, Yijun, et al.
Published: (2026)

Can Lessons From Human Teams Be Applied to Multi-Agent Systems? The Role of Structure, Diversity, and Interaction Dynamics
by: Muralidharan, Rasika, et al.
Published: (2025)

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
by: Xie, Sixiong, et al.
Published: (2026)

LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios
by: Chen, Tianyu, et al.
Published: (2026)

Active Inference in Discrete State Spaces from First Principles
by: Kenny, Patrick
Published: (2025)

PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
by: Zhang, Ruizhi, et al.
Published: (2026)

HiMem: Hierarchical Long-Term Memory for LLM Long-Horizon Agents
by: Zhang, Ningning, et al.
Published: (2026)

Learning Long-Horizon Predictions for Quadrotor Dynamics
by: Rao, Pratyaksh Prabhav, et al.
Published: (2024)

STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks
by: Lobo, ELita, et al.
Published: (2026)

LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning
by: Pushkin, Denys, et al.
Published: (2026)

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
by: Kachwala, Zoher, et al.
Published: (2026)

CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations
by: Gao, Huan-ang, et al.
Published: (2025)