Saved in:
Bibliographic Details
Main Authors: Wang, Weixuan, Han, Dongge, Diaz, Daniel Madrigal, Xu, Jin, Rühle, Victor, Rajmohan, Saravan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2508.09124
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916894020206592
author Wang, Weixuan
Han, Dongge
Diaz, Daniel Madrigal
Xu, Jin
Rühle, Victor
Rajmohan, Saravan
author_facet Wang, Weixuan
Han, Dongge
Diaz, Daniel Madrigal
Xu, Jin
Rühle, Victor
Rajmohan, Saravan
contents Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.
format Preprint
id arxiv_https___arxiv_org_abs_2508_09124
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows
Wang, Weixuan
Han, Dongge
Diaz, Daniel Madrigal
Xu, Jin
Rühle, Victor
Rajmohan, Saravan
Computation and Language
Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.
title OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows
topic Computation and Language
url https://arxiv.org/abs/2508.09124