:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Pandey, Mukund
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.01604
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Evaluation Framework for AI Systems in "the Wild"
by: Jabbour, Sarah, et al.
Published: (2025)

Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation
by: Bhonsle, Roshita, et al.
Published: (2025)

From Failure Modes to Reliability Awareness in Generative and Agentic AI System
by: Janet, et al.
Published: (2025)

An Agentic Evaluation Framework for AI-Generated Scientific Code in PETSc
by: Zhang, Hong, et al.
Published: (2026)

Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems
by: Mehta, Sushant
Published: (2025)

Detecting Silent Failures in Multi-Agentic AI Trajectories
by: Pathak, Divya, et al.
Published: (2025)

Holistic Evaluation and Failure Diagnosis of AI Agents
by: Madvil, Netta, et al.
Published: (2026)

Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems
by: Akshathala, Sreemaee, et al.
Published: (2025)

A Unified Framework for the Evaluation of LLM Agentic Capabilities
by: Zhu, Pengyu, et al.
Published: (2026)

Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
by: Chhabra, Anshuman, et al.
Published: (2025)

RAIL in the Wild: Operationalizing Responsible AI Evaluation Using Anthropic's Value Dataset
by: Verma, Sumit, et al.
Published: (2025)

Creative Adversarial Testing (CAT): A Novel Framework for Evaluating Goal-Oriented Agentic AI Systems
by: Dhrif, Hassen
Published: (2025)

LightAgent: Production-level Open-source Agentic AI Framework
by: Cai, Weige, et al.
Published: (2025)

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries
by: Zhang, Xing, et al.
Published: (2026)

Zero-Direction Probing: A Linear-Algebraic Framework for Deep Analysis of Large-Language-Model Drift
by: Pandey, Amit
Published: (2025)

The Auton Agentic AI Framework
by: Cao, Sheng, et al.
Published: (2026)

Agentic Design Patterns: A System-Theoretic Framework
by: Dao, Minh-Dung, et al.
Published: (2026)

DAO-AI: Evaluating Collective Decision-Making through Agentic AI in Decentralized Governance
by: Capponi, Agostino, et al.
Published: (2025)

WildSpoof Challenge Evaluation Plan
by: Wu, Yihan, et al.
Published: (2025)

AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production
by: Kartik, NVJK, et al.
Published: (2025)

AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems
by: Lee, YenTing, et al.
Published: (2026)

Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering
by: Li, Jingyue, et al.
Published: (2026)

Proper Scoring Rules for Agentic Uncertainty Quantification
by: Raghu, Suresh, et al.
Published: (2026)

GuidelineGuard: An Agentic Framework for Medical Note Evaluation with Guideline Adherence
by: Shahriyear, MD Ragib
Published: (2024)

Control Plane as a Tool: A Scalable Design Pattern for Agentic AI Systems
by: Kandasamy, Sivasathivel
Published: (2025)

Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems
by: Shukla, Manish
Published: (2025)

Performant LLM Agentic Framework for Conversational AI
by: Casella, Alex, et al.
Published: (2025)

Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications
by: Vinay, Vaishali
Published: (2025)

Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier
by: Henry, Jazmia
Published: (2026)

A Conceptual Framework for AI Capability Evaluations
by: Carro, María Victoria, et al.
Published: (2025)

Evaluating Deepfake Detectors in the Wild
by: Pirogov, Viacheslav, et al.
Published: (2025)

Digital Twin and Agentic AI for Wild Fire Disaster Management: Intelligent Virtual Situation Room
by: Morsali, Mohammad, et al.
Published: (2026)

Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild
by: van der Maden, Willem, et al.
Published: (2026)

Ethical AI: Towards Defining a Collective Evaluation Framework
by: Sharma, Aasish Kumar, et al.
Published: (2025)

DREAM: Deep Research Evaluation with Agentic Metrics
by: Avraham, Elad Ben, et al.
Published: (2026)

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation
by: Raza, Shaina, et al.
Published: (2025)

Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals
by: Menon, Achyutha, et al.
Published: (2026)

Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization
by: Blasberg, Alexander, et al.
Published: (2026)

Agentic AI Frameworks: Architectures, Protocols, and Design Challenges
by: Derouiche, Hana, et al.
Published: (2025)

CocoaBench: Evaluating Unified Digital Agents in the Wild
by: CocoaBench Team, et al.
Published: (2026)