Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Weixuan, Han, Dongge, Diaz, Daniel Madrigal, Xu, Jin, Rühle, Victor, Rajmohan, Saravan
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2508.09124
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916894020206592
author	Wang, Weixuan Han, Dongge Diaz, Daniel Madrigal Xu, Jin Rühle, Victor Rajmohan, Saravan
author_facet	Wang, Weixuan Han, Dongge Diaz, Daniel Madrigal Xu, Jin Rühle, Victor Rajmohan, Saravan
contents	Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_09124
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows Wang, Weixuan Han, Dongge Diaz, Daniel Madrigal Xu, Jin Rühle, Victor Rajmohan, Saravan Computation and Language Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.
title	OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows
topic	Computation and Language
url	https://arxiv.org/abs/2508.09124

Similar Items