Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cemri, Mert, Pan, Melissa Z., Yang, Shuyi, Agrawal, Lakshya A., Chopra, Bhavya, Tiwari, Rishabh, Keutzer, Kurt, Parameswaran, Aditya, Klein, Dan, Ramchandran, Kannan, Zaharia, Matei, Gonzalez, Joseph E., Stoica, Ion
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2503.13657
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909870371897344
author	Cemri, Mert Pan, Melissa Z. Yang, Shuyi Agrawal, Lakshya A. Chopra, Bhavya Tiwari, Rishabh Keutzer, Kurt Parameswaran, Aditya Klein, Dan Ramchandran, Kannan Zaharia, Matei Gonzalez, Joseph E. Stoica, Ion
author_facet	Cemri, Mert Pan, Melissa Z. Yang, Shuyi Agrawal, Lakshya A. Chopra, Bhavya Tiwari, Rishabh Keutzer, Kurt Parameswaran, Aditya Klein, Dan Ramchandran, Kannan Zaharia, Matei Gonzalez, Joseph E. Stoica, Ion
contents	Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (kappa = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_13657
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Why Do Multi-Agent LLM Systems Fail? Cemri, Mert Pan, Melissa Z. Yang, Shuyi Agrawal, Lakshya A. Chopra, Bhavya Tiwari, Rishabh Keutzer, Kurt Parameswaran, Aditya Klein, Dan Ramchandran, Kannan Zaharia, Matei Gonzalez, Joseph E. Stoica, Ion Artificial Intelligence Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (kappa = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.
title	Why Do Multi-Agent LLM Systems Fail?
topic	Artificial Intelligence
url	https://arxiv.org/abs/2503.13657

Similar Items