Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jia, Allison Sihan, Huang, Daniel, Vytla, Nikhil, Yoo, Seung Won Wilson, Choudhury, Nirvika, Sen, Shayak, Mitchell, John C., Datta, Anupam
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Multiagent Systems
Online Access:	https://arxiv.org/abs/2510.08847
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917365221949440
author	Jia, Allison Sihan Huang, Daniel Vytla, Nikhil Yoo, Seung Won Wilson Choudhury, Nirvika Sen, Shayak Mitchell, John C. Datta, Anupam
author_facet	Jia, Allison Sihan Huang, Daniel Vytla, Nikhil Yoo, Seung Won Wilson Choudhury, Nirvika Sen, Shayak Mitchell, John C. Datta, Anupam
contents	We introduce the Agent GPA (Goal-Plan-Action) framework, driven by the fundamental insight that critical agent failures emerge at the intersections of setting goals, devising plans, and executing actions. We operationalize the framework with a factorized suite of LLM judges designed to measure distinct elements of Goal-Plan-Act alignment. To make this methodology scalable and generalizable across diverse agent architectures and datasets, we use state-of-the-art automated prompt optimization techniques to systematically generate domain-specific evaluation criteria. We validate this approach across three benchmarks: a multi-agent research setting (TRAIL/GAIA), a single coding agent setting (TRAIL/SWE-bench), and a private, enterprise data-agent setting (Snowflake Intelligence). Extensive evaluation on TRAIL/GAIA demonstrates the core validity of the framework, which identifies a broad range of agent failures (95% of human-annotated errors), localizes errors to enable targeted debugging (86% of human-annotated errors), and exhibits strong agreement with human evaluators. Crucially, by applying our automated methodology to both public datasets, we demonstrate that our GPA judges generally achieve the highest error coverage (ranging from 76% to 86%) in comparison to manual prompting approaches. We also leverage an evolutionary coding agent to improve judge consistency by up to 38% through iterative refinement of evaluation rubrics. Overall, Agent GPA provides a rigorous and generalizable paradigm for targeted agent evaluation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_08847
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment Jia, Allison Sihan Huang, Daniel Vytla, Nikhil Yoo, Seung Won Wilson Choudhury, Nirvika Sen, Shayak Mitchell, John C. Datta, Anupam Artificial Intelligence Multiagent Systems We introduce the Agent GPA (Goal-Plan-Action) framework, driven by the fundamental insight that critical agent failures emerge at the intersections of setting goals, devising plans, and executing actions. We operationalize the framework with a factorized suite of LLM judges designed to measure distinct elements of Goal-Plan-Act alignment. To make this methodology scalable and generalizable across diverse agent architectures and datasets, we use state-of-the-art automated prompt optimization techniques to systematically generate domain-specific evaluation criteria. We validate this approach across three benchmarks: a multi-agent research setting (TRAIL/GAIA), a single coding agent setting (TRAIL/SWE-bench), and a private, enterprise data-agent setting (Snowflake Intelligence). Extensive evaluation on TRAIL/GAIA demonstrates the core validity of the framework, which identifies a broad range of agent failures (95% of human-annotated errors), localizes errors to enable targeted debugging (86% of human-annotated errors), and exhibits strong agreement with human evaluators. Crucially, by applying our automated methodology to both public datasets, we demonstrate that our GPA judges generally achieve the highest error coverage (ranging from 76% to 86%) in comparison to manual prompting approaches. We also leverage an evolutionary coding agent to improve judge consistency by up to 38% through iterative refinement of evaluation rubrics. Overall, Agent GPA provides a rigorous and generalizable paradigm for targeted agent evaluation.
title	What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment
topic	Artificial Intelligence Multiagent Systems
url	https://arxiv.org/abs/2510.08847

Similar Items