Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Raj, Harsh, Orkat, Niranjan, Mukherjee, Suvrorup, Guha, Aritra, Flynn, Cheryl, Majumdar, Subhabrata
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.10516
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917480618786816
author	Raj, Harsh Orkat, Niranjan Mukherjee, Suvrorup Guha, Aritra Flynn, Cheryl Majumdar, Subhabrata
author_facet	Raj, Harsh Orkat, Niranjan Mukherjee, Suvrorup Guha, Aritra Flynn, Cheryl Majumdar, Subhabrata
contents	This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging $U$-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_10516
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability Raj, Harsh Orkat, Niranjan Mukherjee, Suvrorup Guha, Aritra Flynn, Cheryl Majumdar, Subhabrata Artificial Intelligence This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging $U$-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.
title	Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.10516

Similar Items