Saved in:
| Main Authors: | , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.03565 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915971433758720 |
|---|---|
| author | Herrera, Alejandro Breen Sheth, Aayush Xu, Steven G. Zhan, Zhucheng Wright, Charles Yearwood, Marcus Wei, Hongtai Das, Sudeep Nightingale, Danny Watson, Meg Pollnow V, Charles |
| author_facet | Herrera, Alejandro Breen Sheth, Aayush Xu, Steven G. Zhan, Zhucheng Wright, Charles Yearwood, Marcus Wei, Hongtai Das, Sudeep Nightingale, Danny Watson, Meg Pollnow V, Charles |
| contents | Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly coupled multi-agent systems. Grocery shopping further amplifies these difficulties, as user requests are often underspecified, highly preference-sensitive, and constrained by factors such as budget and inventory. In this paper, we present a practical blueprint for evaluating and optimizing conversational shopping assistants, illustrated through a production-scale AI grocery assistant. We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations. Building on this evaluation foundation, we investigate two complementary prompt-optimization strategies based on a SOTA prompt-optimizer called GEPA (Shao et al., 2025): (1) Sub-agent GEPA, which optimizes individual agent nodes against localized rubrics, and (2) MAMuT (Multi-Agent Multi-Turn) GEPA (Herrera et al., 2026), a novel system-level approach that jointly optimizes prompts across agents using multi-turn simulation and trajectory-level scoring. We release rubric templates and evaluation design guidance to support practitioners building production CSAs. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_03565 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants Herrera, Alejandro Breen Sheth, Aayush Xu, Steven G. Zhan, Zhucheng Wright, Charles Yearwood, Marcus Wei, Hongtai Das, Sudeep Nightingale, Danny Watson, Meg Pollnow V, Charles Artificial Intelligence Computation and Language Machine Learning Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly coupled multi-agent systems. Grocery shopping further amplifies these difficulties, as user requests are often underspecified, highly preference-sensitive, and constrained by factors such as budget and inventory. In this paper, we present a practical blueprint for evaluating and optimizing conversational shopping assistants, illustrated through a production-scale AI grocery assistant. We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations. Building on this evaluation foundation, we investigate two complementary prompt-optimization strategies based on a SOTA prompt-optimizer called GEPA (Shao et al., 2025): (1) Sub-agent GEPA, which optimizes individual agent nodes against localized rubrics, and (2) MAMuT (Multi-Agent Multi-Turn) GEPA (Herrera et al., 2026), a novel system-level approach that jointly optimizes prompts across agents using multi-turn simulation and trajectory-level scoring. We release rubric templates and evaluation design guidance to support practitioners building production CSAs. |
| title | Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants |
| topic | Artificial Intelligence Computation and Language Machine Learning |
| url | https://arxiv.org/abs/2603.03565 |