Saved in:
Bibliographic Details
Main Authors: Cemri, Mert, Pan, Melissa Z., Yang, Shuyi, Agrawal, Lakshya A., Chopra, Bhavya, Tiwari, Rishabh, Keutzer, Kurt, Parameswaran, Aditya, Klein, Dan, Ramchandran, Kannan, Zaharia, Matei, Gonzalez, Joseph E., Stoica, Ion
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.13657
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909870371897344
author Cemri, Mert
Pan, Melissa Z.
Yang, Shuyi
Agrawal, Lakshya A.
Chopra, Bhavya
Tiwari, Rishabh
Keutzer, Kurt
Parameswaran, Aditya
Klein, Dan
Ramchandran, Kannan
Zaharia, Matei
Gonzalez, Joseph E.
Stoica, Ion
author_facet Cemri, Mert
Pan, Melissa Z.
Yang, Shuyi
Agrawal, Lakshya A.
Chopra, Bhavya
Tiwari, Rishabh
Keutzer, Kurt
Parameswaran, Aditya
Klein, Dan
Ramchandran, Kannan
Zaharia, Matei
Gonzalez, Joseph E.
Stoica, Ion
contents Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (kappa = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.
format Preprint
id arxiv_https___arxiv_org_abs_2503_13657
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Why Do Multi-Agent LLM Systems Fail?
Cemri, Mert
Pan, Melissa Z.
Yang, Shuyi
Agrawal, Lakshya A.
Chopra, Bhavya
Tiwari, Rishabh
Keutzer, Kurt
Parameswaran, Aditya
Klein, Dan
Ramchandran, Kannan
Zaharia, Matei
Gonzalez, Joseph E.
Stoica, Ion
Artificial Intelligence
Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (kappa = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.
title Why Do Multi-Agent LLM Systems Fail?
topic Artificial Intelligence
url https://arxiv.org/abs/2503.13657