Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Ao, Zhang, Jinghui, Li, Luyu, Duan, Yuxiang, Gao, Lang, Chen, Mingcai, Qin, Weijun, Li, Shaopeng, Ji, Fengxian, Liu, Ning, Cui, Lizhen, Chen, Xiuying, Du, Yuntao
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.02854
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912804871602176
author	Li, Ao Zhang, Jinghui Li, Luyu Duan, Yuxiang Gao, Lang Chen, Mingcai Qin, Weijun Li, Shaopeng Ji, Fengxian Liu, Ning Cui, Lizhen Chen, Xiuying Du, Yuntao
author_facet	Li, Ao Zhang, Jinghui Li, Luyu Duan, Yuxiang Gao, Lang Chen, Mingcai Qin, Weijun Li, Shaopeng Ji, Fengxian Liu, Ning Cui, Lizhen Chen, Xiuying Du, Yuntao
contents	As an agent-level reasoning and coordination paradigm, Multi-Agent Debate (MAD) orchestrates multiple agents through structured debate to improve answer quality and support complex reasoning. However, existing research on MAD suffers from two fundamental limitations: evaluations are conducted under fragmented and inconsistent settings, hindering fair comparison, and are largely restricted to single-modality scenarios that rely on textual inputs only. To address these gaps, we introduce M3MAD-Bench, a unified and extensible benchmark for evaluating MAD methods across Multi-domain tasks, Multi-modal inputs, and Multi-dimensional metrics. M3MAD-Bench establishes standardized protocols over five core task domains: Knowledge, Mathematics, Medicine, Natural Sciences, and Complex Reasoning, and systematically covers both pure text and vision-language datasets, enabling controlled cross-modality comparison. We evaluate MAD methods on nine base models spanning different architectures, scales, and modality capabilities. Beyond accuracy, M3MAD-Bench incorporates efficiency-oriented metrics such as token consumption and inference time, providing a holistic view of performance--cost trade-offs. Extensive experiments yield systematic insights into the effectiveness, robustness, and efficiency of MAD across text-only and multimodal scenarios. We believe M3MAD-Bench offers a reliable foundation for future research on standardized MAD evaluation. The code is available at http://github.com/liaolea/M3MAD-Bench.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_02854
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities? Li, Ao Zhang, Jinghui Li, Luyu Duan, Yuxiang Gao, Lang Chen, Mingcai Qin, Weijun Li, Shaopeng Ji, Fengxian Liu, Ning Cui, Lizhen Chen, Xiuying Du, Yuntao Artificial Intelligence As an agent-level reasoning and coordination paradigm, Multi-Agent Debate (MAD) orchestrates multiple agents through structured debate to improve answer quality and support complex reasoning. However, existing research on MAD suffers from two fundamental limitations: evaluations are conducted under fragmented and inconsistent settings, hindering fair comparison, and are largely restricted to single-modality scenarios that rely on textual inputs only. To address these gaps, we introduce M3MAD-Bench, a unified and extensible benchmark for evaluating MAD methods across Multi-domain tasks, Multi-modal inputs, and Multi-dimensional metrics. M3MAD-Bench establishes standardized protocols over five core task domains: Knowledge, Mathematics, Medicine, Natural Sciences, and Complex Reasoning, and systematically covers both pure text and vision-language datasets, enabling controlled cross-modality comparison. We evaluate MAD methods on nine base models spanning different architectures, scales, and modality capabilities. Beyond accuracy, M3MAD-Bench incorporates efficiency-oriented metrics such as token consumption and inference time, providing a holistic view of performance--cost trade-offs. Extensive experiments yield systematic insights into the effectiveness, robustness, and efficiency of MAD across text-only and multimodal scenarios. We believe M3MAD-Bench offers a reliable foundation for future research on standardized MAD evaluation. The code is available at http://github.com/liaolea/M3MAD-Bench.
title	M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?
topic	Artificial Intelligence
url	https://arxiv.org/abs/2601.02854

Similar Items