Saved in:
Bibliographic Details
Main Authors: Raj, Harsh, Orkat, Niranjan, Mukherjee, Suvrorup, Guha, Aritra, Flynn, Cheryl, Majumdar, Subhabrata
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.10516
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917480618786816
author Raj, Harsh
Orkat, Niranjan
Mukherjee, Suvrorup
Guha, Aritra
Flynn, Cheryl
Majumdar, Subhabrata
author_facet Raj, Harsh
Orkat, Niranjan
Mukherjee, Suvrorup
Guha, Aritra
Flynn, Cheryl
Majumdar, Subhabrata
contents This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging $U$-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.
format Preprint
id arxiv_https___arxiv_org_abs_2605_10516
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
Raj, Harsh
Orkat, Niranjan
Mukherjee, Suvrorup
Guha, Aritra
Flynn, Cheryl
Majumdar, Subhabrata
Artificial Intelligence
This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging $U$-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.
title Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
topic Artificial Intelligence
url https://arxiv.org/abs/2605.10516