Saved in:
Bibliographic Details
Main Authors: Moshkovich, Dany, Zeltyn, Sergey
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.11277
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917094875987968
author Moshkovich, Dany
Zeltyn, Sergey
author_facet Moshkovich, Dany
Zeltyn, Sergey
contents Large Language Models (LLMs) are increasingly deployed within agentic systems - collections of interacting, LLM-powered agents that execute complex, adaptive workflows using memory, tools, and dynamic planning. While enabling powerful new capabilities, these systems also introduce unique forms of uncertainty stemming from probabilistic reasoning, evolving memory states, and fluid execution paths. Traditional software observability and operations practices fall short in addressing these challenges. This paper presents our vision of AgentOps: a comprehensive framework for observing, analyzing, optimizing, and automating operation of agentic AI systems. We identify distinct needs across four key roles - developers, testers, site reliability engineers (SREs), and business users - each of whom engages with the system at different points in its lifecycle. We present the AgentOps Automation Pipeline, a six-stage process encompassing behavior observation, metric collection, issue detection, root cause analysis, optimized recommendations, and runtime automation. Throughout, we emphasize the critical role of automation in managing uncertainty and enabling self-improving AI systems - not by eliminating uncertainty, but by taming it to ensure safe, adaptive, and effective operation.
format Preprint
id arxiv_https___arxiv_org_abs_2507_11277
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Taming Uncertainty via Automation: Observing, Analyzing, and Optimizing Agentic AI Systems
Moshkovich, Dany
Zeltyn, Sergey
Artificial Intelligence
Multiagent Systems
Large Language Models (LLMs) are increasingly deployed within agentic systems - collections of interacting, LLM-powered agents that execute complex, adaptive workflows using memory, tools, and dynamic planning. While enabling powerful new capabilities, these systems also introduce unique forms of uncertainty stemming from probabilistic reasoning, evolving memory states, and fluid execution paths. Traditional software observability and operations practices fall short in addressing these challenges. This paper presents our vision of AgentOps: a comprehensive framework for observing, analyzing, optimizing, and automating operation of agentic AI systems. We identify distinct needs across four key roles - developers, testers, site reliability engineers (SREs), and business users - each of whom engages with the system at different points in its lifecycle. We present the AgentOps Automation Pipeline, a six-stage process encompassing behavior observation, metric collection, issue detection, root cause analysis, optimized recommendations, and runtime automation. Throughout, we emphasize the critical role of automation in managing uncertainty and enabling self-improving AI systems - not by eliminating uncertainty, but by taming it to ensure safe, adaptive, and effective operation.
title Taming Uncertainty via Automation: Observing, Analyzing, and Optimizing Agentic AI Systems
topic Artificial Intelligence
Multiagent Systems
url https://arxiv.org/abs/2507.11277