Saved in:
Bibliographic Details
Main Authors: Herrera, Alejandro Breen, Sheth, Aayush, Xu, Steven G., Zhan, Zhucheng, Wright, Charles, Yearwood, Marcus, Wei, Hongtai, Das, Sudeep, Nightingale, Danny, Watson, Meg, Pollnow V, Charles
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.03565
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915971433758720
author Herrera, Alejandro Breen
Sheth, Aayush
Xu, Steven G.
Zhan, Zhucheng
Wright, Charles
Yearwood, Marcus
Wei, Hongtai
Das, Sudeep
Nightingale, Danny
Watson, Meg
Pollnow V, Charles
author_facet Herrera, Alejandro Breen
Sheth, Aayush
Xu, Steven G.
Zhan, Zhucheng
Wright, Charles
Yearwood, Marcus
Wei, Hongtai
Das, Sudeep
Nightingale, Danny
Watson, Meg
Pollnow V, Charles
contents Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly coupled multi-agent systems. Grocery shopping further amplifies these difficulties, as user requests are often underspecified, highly preference-sensitive, and constrained by factors such as budget and inventory. In this paper, we present a practical blueprint for evaluating and optimizing conversational shopping assistants, illustrated through a production-scale AI grocery assistant. We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations. Building on this evaluation foundation, we investigate two complementary prompt-optimization strategies based on a SOTA prompt-optimizer called GEPA (Shao et al., 2025): (1) Sub-agent GEPA, which optimizes individual agent nodes against localized rubrics, and (2) MAMuT (Multi-Agent Multi-Turn) GEPA (Herrera et al., 2026), a novel system-level approach that jointly optimizes prompts across agents using multi-turn simulation and trajectory-level scoring. We release rubric templates and evaluation design guidance to support practitioners building production CSAs.
format Preprint
id arxiv_https___arxiv_org_abs_2603_03565
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Herrera, Alejandro Breen
Sheth, Aayush
Xu, Steven G.
Zhan, Zhucheng
Wright, Charles
Yearwood, Marcus
Wei, Hongtai
Das, Sudeep
Nightingale, Danny
Watson, Meg
Pollnow V, Charles
Artificial Intelligence
Computation and Language
Machine Learning
Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly coupled multi-agent systems. Grocery shopping further amplifies these difficulties, as user requests are often underspecified, highly preference-sensitive, and constrained by factors such as budget and inventory. In this paper, we present a practical blueprint for evaluating and optimizing conversational shopping assistants, illustrated through a production-scale AI grocery assistant. We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations. Building on this evaluation foundation, we investigate two complementary prompt-optimization strategies based on a SOTA prompt-optimizer called GEPA (Shao et al., 2025): (1) Sub-agent GEPA, which optimizes individual agent nodes against localized rubrics, and (2) MAMuT (Multi-Agent Multi-Turn) GEPA (Herrera et al., 2026), a novel system-level approach that jointly optimizes prompts across agents using multi-turn simulation and trajectory-level scoring. We release rubric templates and evaluation design guidance to support practitioners building production CSAs.
title Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
topic Artificial Intelligence
Computation and Language
Machine Learning
url https://arxiv.org/abs/2603.03565