Saved in:
Bibliographic Details
Main Authors: Chen, Xudong, Liu, Yixin, Wei, Hua, Ding, Kaize
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.14483
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909042077597696
author Chen, Xudong
Liu, Yixin
Wei, Hua
Ding, Kaize
author_facet Chen, Xudong
Liu, Yixin
Wei, Hua
Ding, Kaize
contents Large language models (LLMs) have become a strong foundation for multi-agent systems, but their effectiveness depends heavily on orchestration design. Across different tasks, role design, capacity assignment, and dependency construction jointly affect both solution quality and execution efficiency. Existing approaches automate parts of this design process, yet they often optimize these decisions partially or sequentially, and rely on execution-level feedback that provides limited credit assignment for local orchestration decisions. We propose LEMON (\textbf{L}earning \textbf{E}xecutable \textbf{M}ulti-agent \textbf{O}rchestratio\textbf{N} via Counterfactual Reinforcement Learning), an LLM-based orchestrator that generates an executable orchestration specification. The specification integrates task-specific roles, customized duties, capacity levels, and dependency structure into a single deployable system. To train the orchestrator, we augment the orchestration-level GRPO objective with a localized counterfactual signal that edits role, capacity, or dependency fields and applies the resulting reward contrast only to the edited spans. Experiments on six reasoning and coding benchmarks, including MMLU, GSM8K, AQuA, MultiArith, SVAMP, and HumanEval, show that LEMON achieves state-of-the-art performance among the evaluated multi-agent orchestration methods. Our code is available at https://anonymous.4open.science/r/LEMON-B23C.
format Preprint
id arxiv_https___arxiv_org_abs_2605_14483
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning
Chen, Xudong
Liu, Yixin
Wei, Hua
Ding, Kaize
Artificial Intelligence
Large language models (LLMs) have become a strong foundation for multi-agent systems, but their effectiveness depends heavily on orchestration design. Across different tasks, role design, capacity assignment, and dependency construction jointly affect both solution quality and execution efficiency. Existing approaches automate parts of this design process, yet they often optimize these decisions partially or sequentially, and rely on execution-level feedback that provides limited credit assignment for local orchestration decisions. We propose LEMON (\textbf{L}earning \textbf{E}xecutable \textbf{M}ulti-agent \textbf{O}rchestratio\textbf{N} via Counterfactual Reinforcement Learning), an LLM-based orchestrator that generates an executable orchestration specification. The specification integrates task-specific roles, customized duties, capacity levels, and dependency structure into a single deployable system. To train the orchestrator, we augment the orchestration-level GRPO objective with a localized counterfactual signal that edits role, capacity, or dependency fields and applies the resulting reward contrast only to the edited spans. Experiments on six reasoning and coding benchmarks, including MMLU, GSM8K, AQuA, MultiArith, SVAMP, and HumanEval, show that LEMON achieves state-of-the-art performance among the evaluated multi-agent orchestration methods. Our code is available at https://anonymous.4open.science/r/LEMON-B23C.
title LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning
topic Artificial Intelligence
url https://arxiv.org/abs/2605.14483