Saved in:
| Main Authors: | Diks, Ian, Muralidharan, Harihara, Proctor, Tim, Workman, Kenny |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.28065 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SpatialBench: Can Agents Analyze Real-World Spatial Biology Data?
by: Workman, Kenny, et al.
Published: (2025)
by: Workman, Kenny, et al.
Published: (2025)
scBench: Evaluating AI Agents on Single-Cell RNA-seq Analysis
by: Workman, Kenny, et al.
Published: (2026)
by: Workman, Kenny, et al.
Published: (2026)
DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
by: Zhang, Yinger, et al.
Published: (2026)
by: Zhang, Yinger, et al.
Published: (2026)
NaturalGAIA: A Verifiable Benchmark and Hierarchical Framework for Long-Horizon GUI Tasks
by: Zheng, Zihan, et al.
Published: (2025)
by: Zheng, Zihan, et al.
Published: (2025)
UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios
by: Luo, Haotian, et al.
Published: (2025)
by: Luo, Haotian, et al.
Published: (2025)
Spatially Grounded Long-Horizon Task Planning in the Wild
by: Jung, Sehun, et al.
Published: (2026)
by: Jung, Sehun, et al.
Published: (2026)
LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
by: Motwani, Sumeet Ramesh, et al.
Published: (2026)
by: Motwani, Sumeet Ramesh, et al.
Published: (2026)
RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
by: Zhang, Zijing, et al.
Published: (2025)
by: Zhang, Zijing, et al.
Published: (2025)
LifeBench: A Benchmark for Long-Horizon Multi-Source Memory
by: Cheng, Zihao, et al.
Published: (2026)
by: Cheng, Zihao, et al.
Published: (2026)
AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks
by: Jiang, Tanqiu, et al.
Published: (2026)
by: Jiang, Tanqiu, et al.
Published: (2026)
KellyBench: A Benchmark for Long-Horizon Sequential Decision Making
by: Grady, Thomas, et al.
Published: (2026)
by: Grady, Thomas, et al.
Published: (2026)
Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue
by: Lin, Jingjie, et al.
Published: (2026)
by: Lin, Jingjie, et al.
Published: (2026)
MinePlanner: A Benchmark for Long-Horizon Planning in Large Minecraft Worlds
by: Hill, William, et al.
Published: (2023)
by: Hill, William, et al.
Published: (2023)
ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
by: Imajuku, Yuki, et al.
Published: (2025)
by: Imajuku, Yuki, et al.
Published: (2025)
LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability
by: Xiao, Zikai, et al.
Published: (2025)
by: Xiao, Zikai, et al.
Published: (2025)
AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
by: Jiayang, Cheng, et al.
Published: (2026)
by: Jiayang, Cheng, et al.
Published: (2026)
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
by: Anokhin, Petr, et al.
Published: (2025)
by: Anokhin, Petr, et al.
Published: (2025)
When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents
by: Yan, Lu, et al.
Published: (2026)
by: Yan, Lu, et al.
Published: (2026)
Can LLM Agents Be CFOs? Benchmarking Long-Horizon Resource Allocation in an Uncertain Enterprise Environment
by: Han, Yi, et al.
Published: (2026)
by: Han, Yi, et al.
Published: (2026)
When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution
by: Zhu, Zilin, et al.
Published: (2026)
by: Zhu, Zilin, et al.
Published: (2026)
VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains
by: Li, Xuzhao, et al.
Published: (2025)
by: Li, Xuzhao, et al.
Published: (2025)
BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
by: Yang, Yue, et al.
Published: (2025)
by: Yang, Yue, et al.
Published: (2025)
HorizonBench: Long-Horizon Personalization with Evolving Preferences
by: Li, Shuyue Stella, et al.
Published: (2026)
by: Li, Shuyue Stella, et al.
Published: (2026)
TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios
by: Shen, Yuanzhe, et al.
Published: (2026)
by: Shen, Yuanzhe, et al.
Published: (2026)
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
by: Li, Junlong, et al.
Published: (2025)
by: Li, Junlong, et al.
Published: (2025)
λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics
by: Jaafar, Ahmed, et al.
Published: (2024)
by: Jaafar, Ahmed, et al.
Published: (2024)
SpatialMem: Metric-Aligned Long-Horizon Video Memory for Language Grounding and QA
by: Zheng, Xinyi, et al.
Published: (2026)
by: Zheng, Xinyi, et al.
Published: (2026)
ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks
by: Song, Yuanyi, et al.
Published: (2025)
by: Song, Yuanyi, et al.
Published: (2025)
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
by: Lu, Yijun, et al.
Published: (2026)
by: Lu, Yijun, et al.
Published: (2026)
Can Lessons From Human Teams Be Applied to Multi-Agent Systems? The Role of Structure, Diversity, and Interaction Dynamics
by: Muralidharan, Rasika, et al.
Published: (2025)
by: Muralidharan, Rasika, et al.
Published: (2025)
DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
by: Xie, Sixiong, et al.
Published: (2026)
by: Xie, Sixiong, et al.
Published: (2026)
LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios
by: Chen, Tianyu, et al.
Published: (2026)
by: Chen, Tianyu, et al.
Published: (2026)
Active Inference in Discrete State Spaces from First Principles
by: Kenny, Patrick
Published: (2025)
by: Kenny, Patrick
Published: (2025)
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
by: Zhang, Ruizhi, et al.
Published: (2026)
by: Zhang, Ruizhi, et al.
Published: (2026)
HiMem: Hierarchical Long-Term Memory for LLM Long-Horizon Agents
by: Zhang, Ningning, et al.
Published: (2026)
by: Zhang, Ningning, et al.
Published: (2026)
Learning Long-Horizon Predictions for Quadrotor Dynamics
by: Rao, Pratyaksh Prabhav, et al.
Published: (2024)
by: Rao, Pratyaksh Prabhav, et al.
Published: (2024)
STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks
by: Lobo, ELita, et al.
Published: (2026)
by: Lobo, ELita, et al.
Published: (2026)
LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning
by: Pushkin, Denys, et al.
Published: (2026)
by: Pushkin, Denys, et al.
Published: (2026)
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
by: Kachwala, Zoher, et al.
Published: (2026)
by: Kachwala, Zoher, et al.
Published: (2026)
CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations
by: Gao, Huan-ang, et al.
Published: (2025)
by: Gao, Huan-ang, et al.
Published: (2025)
Similar Items
-
SpatialBench: Can Agents Analyze Real-World Spatial Biology Data?
by: Workman, Kenny, et al.
Published: (2025) -
scBench: Evaluating AI Agents on Single-Cell RNA-seq Analysis
by: Workman, Kenny, et al.
Published: (2026) -
DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
by: Zhang, Yinger, et al.
Published: (2026) -
NaturalGAIA: A Verifiable Benchmark and Hierarchical Framework for Long-Horizon GUI Tasks
by: Zheng, Zihan, et al.
Published: (2025) -
UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios
by: Luo, Haotian, et al.
Published: (2025)