Saved in:
Bibliographic Details
Main Authors: Jia, Allison Sihan, Huang, Daniel, Vytla, Nikhil, Yoo, Seung Won Wilson, Choudhury, Nirvika, Sen, Shayak, Mitchell, John C., Datta, Anupam
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.08847
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917365221949440
author Jia, Allison Sihan
Huang, Daniel
Vytla, Nikhil
Yoo, Seung Won Wilson
Choudhury, Nirvika
Sen, Shayak
Mitchell, John C.
Datta, Anupam
author_facet Jia, Allison Sihan
Huang, Daniel
Vytla, Nikhil
Yoo, Seung Won Wilson
Choudhury, Nirvika
Sen, Shayak
Mitchell, John C.
Datta, Anupam
contents We introduce the Agent GPA (Goal-Plan-Action) framework, driven by the fundamental insight that critical agent failures emerge at the intersections of setting goals, devising plans, and executing actions. We operationalize the framework with a factorized suite of LLM judges designed to measure distinct elements of Goal-Plan-Act alignment. To make this methodology scalable and generalizable across diverse agent architectures and datasets, we use state-of-the-art automated prompt optimization techniques to systematically generate domain-specific evaluation criteria. We validate this approach across three benchmarks: a multi-agent research setting (TRAIL/GAIA), a single coding agent setting (TRAIL/SWE-bench), and a private, enterprise data-agent setting (Snowflake Intelligence). Extensive evaluation on TRAIL/GAIA demonstrates the core validity of the framework, which identifies a broad range of agent failures (95% of human-annotated errors), localizes errors to enable targeted debugging (86% of human-annotated errors), and exhibits strong agreement with human evaluators. Crucially, by applying our automated methodology to both public datasets, we demonstrate that our GPA judges generally achieve the highest error coverage (ranging from 76% to 86%) in comparison to manual prompting approaches. We also leverage an evolutionary coding agent to improve judge consistency by up to 38% through iterative refinement of evaluation rubrics. Overall, Agent GPA provides a rigorous and generalizable paradigm for targeted agent evaluation.
format Preprint
id arxiv_https___arxiv_org_abs_2510_08847
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment
Jia, Allison Sihan
Huang, Daniel
Vytla, Nikhil
Yoo, Seung Won Wilson
Choudhury, Nirvika
Sen, Shayak
Mitchell, John C.
Datta, Anupam
Artificial Intelligence
Multiagent Systems
We introduce the Agent GPA (Goal-Plan-Action) framework, driven by the fundamental insight that critical agent failures emerge at the intersections of setting goals, devising plans, and executing actions. We operationalize the framework with a factorized suite of LLM judges designed to measure distinct elements of Goal-Plan-Act alignment. To make this methodology scalable and generalizable across diverse agent architectures and datasets, we use state-of-the-art automated prompt optimization techniques to systematically generate domain-specific evaluation criteria. We validate this approach across three benchmarks: a multi-agent research setting (TRAIL/GAIA), a single coding agent setting (TRAIL/SWE-bench), and a private, enterprise data-agent setting (Snowflake Intelligence). Extensive evaluation on TRAIL/GAIA demonstrates the core validity of the framework, which identifies a broad range of agent failures (95% of human-annotated errors), localizes errors to enable targeted debugging (86% of human-annotated errors), and exhibits strong agreement with human evaluators. Crucially, by applying our automated methodology to both public datasets, we demonstrate that our GPA judges generally achieve the highest error coverage (ranging from 76% to 86%) in comparison to manual prompting approaches. We also leverage an evolutionary coding agent to improve judge consistency by up to 38% through iterative refinement of evaluation rubrics. Overall, Agent GPA provides a rigorous and generalizable paradigm for targeted agent evaluation.
title What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment
topic Artificial Intelligence
Multiagent Systems
url https://arxiv.org/abs/2510.08847