Saved in:
| Main Author: | Pandey, Mukund |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.01604 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Evaluation Framework for AI Systems in "the Wild"
by: Jabbour, Sarah, et al.
Published: (2025)
by: Jabbour, Sarah, et al.
Published: (2025)
Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation
by: Bhonsle, Roshita, et al.
Published: (2025)
by: Bhonsle, Roshita, et al.
Published: (2025)
From Failure Modes to Reliability Awareness in Generative and Agentic AI System
by: Janet, et al.
Published: (2025)
by: Janet, et al.
Published: (2025)
An Agentic Evaluation Framework for AI-Generated Scientific Code in PETSc
by: Zhang, Hong, et al.
Published: (2026)
by: Zhang, Hong, et al.
Published: (2026)
Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems
by: Mehta, Sushant
Published: (2025)
by: Mehta, Sushant
Published: (2025)
Detecting Silent Failures in Multi-Agentic AI Trajectories
by: Pathak, Divya, et al.
Published: (2025)
by: Pathak, Divya, et al.
Published: (2025)
Holistic Evaluation and Failure Diagnosis of AI Agents
by: Madvil, Netta, et al.
Published: (2026)
by: Madvil, Netta, et al.
Published: (2026)
Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems
by: Akshathala, Sreemaee, et al.
Published: (2025)
by: Akshathala, Sreemaee, et al.
Published: (2025)
A Unified Framework for the Evaluation of LLM Agentic Capabilities
by: Zhu, Pengyu, et al.
Published: (2026)
by: Zhu, Pengyu, et al.
Published: (2026)
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
by: Chhabra, Anshuman, et al.
Published: (2025)
by: Chhabra, Anshuman, et al.
Published: (2025)
RAIL in the Wild: Operationalizing Responsible AI Evaluation Using Anthropic's Value Dataset
by: Verma, Sumit, et al.
Published: (2025)
by: Verma, Sumit, et al.
Published: (2025)
Creative Adversarial Testing (CAT): A Novel Framework for Evaluating Goal-Oriented Agentic AI Systems
by: Dhrif, Hassen
Published: (2025)
by: Dhrif, Hassen
Published: (2025)
LightAgent: Production-level Open-source Agentic AI Framework
by: Cai, Weige, et al.
Published: (2025)
by: Cai, Weige, et al.
Published: (2025)
Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries
by: Zhang, Xing, et al.
Published: (2026)
by: Zhang, Xing, et al.
Published: (2026)
Zero-Direction Probing: A Linear-Algebraic Framework for Deep Analysis of Large-Language-Model Drift
by: Pandey, Amit
Published: (2025)
by: Pandey, Amit
Published: (2025)
The Auton Agentic AI Framework
by: Cao, Sheng, et al.
Published: (2026)
by: Cao, Sheng, et al.
Published: (2026)
Agentic Design Patterns: A System-Theoretic Framework
by: Dao, Minh-Dung, et al.
Published: (2026)
by: Dao, Minh-Dung, et al.
Published: (2026)
DAO-AI: Evaluating Collective Decision-Making through Agentic AI in Decentralized Governance
by: Capponi, Agostino, et al.
Published: (2025)
by: Capponi, Agostino, et al.
Published: (2025)
WildSpoof Challenge Evaluation Plan
by: Wu, Yihan, et al.
Published: (2025)
by: Wu, Yihan, et al.
Published: (2025)
AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production
by: Kartik, NVJK, et al.
Published: (2025)
by: Kartik, NVJK, et al.
Published: (2025)
AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems
by: Lee, YenTing, et al.
Published: (2026)
by: Lee, YenTing, et al.
Published: (2026)
Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering
by: Li, Jingyue, et al.
Published: (2026)
by: Li, Jingyue, et al.
Published: (2026)
Proper Scoring Rules for Agentic Uncertainty Quantification
by: Raghu, Suresh, et al.
Published: (2026)
by: Raghu, Suresh, et al.
Published: (2026)
GuidelineGuard: An Agentic Framework for Medical Note Evaluation with Guideline Adherence
by: Shahriyear, MD Ragib
Published: (2024)
by: Shahriyear, MD Ragib
Published: (2024)
Control Plane as a Tool: A Scalable Design Pattern for Agentic AI Systems
by: Kandasamy, Sivasathivel
Published: (2025)
by: Kandasamy, Sivasathivel
Published: (2025)
Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems
by: Shukla, Manish
Published: (2025)
by: Shukla, Manish
Published: (2025)
Performant LLM Agentic Framework for Conversational AI
by: Casella, Alex, et al.
Published: (2025)
by: Casella, Alex, et al.
Published: (2025)
Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications
by: Vinay, Vaishali
Published: (2025)
by: Vinay, Vaishali
Published: (2025)
Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier
by: Henry, Jazmia
Published: (2026)
by: Henry, Jazmia
Published: (2026)
A Conceptual Framework for AI Capability Evaluations
by: Carro, María Victoria, et al.
Published: (2025)
by: Carro, María Victoria, et al.
Published: (2025)
Evaluating Deepfake Detectors in the Wild
by: Pirogov, Viacheslav, et al.
Published: (2025)
by: Pirogov, Viacheslav, et al.
Published: (2025)
Digital Twin and Agentic AI for Wild Fire Disaster Management: Intelligent Virtual Situation Room
by: Morsali, Mohammad, et al.
Published: (2026)
by: Morsali, Mohammad, et al.
Published: (2026)
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild
by: van der Maden, Willem, et al.
Published: (2026)
by: van der Maden, Willem, et al.
Published: (2026)
Ethical AI: Towards Defining a Collective Evaluation Framework
by: Sharma, Aasish Kumar, et al.
Published: (2025)
by: Sharma, Aasish Kumar, et al.
Published: (2025)
DREAM: Deep Research Evaluation with Agentic Metrics
by: Avraham, Elad Ben, et al.
Published: (2026)
by: Avraham, Elad Ben, et al.
Published: (2026)
HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation
by: Raza, Shaina, et al.
Published: (2025)
by: Raza, Shaina, et al.
Published: (2025)
Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals
by: Menon, Achyutha, et al.
Published: (2026)
by: Menon, Achyutha, et al.
Published: (2026)
Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization
by: Blasberg, Alexander, et al.
Published: (2026)
by: Blasberg, Alexander, et al.
Published: (2026)
Agentic AI Frameworks: Architectures, Protocols, and Design Challenges
by: Derouiche, Hana, et al.
Published: (2025)
by: Derouiche, Hana, et al.
Published: (2025)
CocoaBench: Evaluating Unified Digital Agents in the Wild
by: CocoaBench Team, et al.
Published: (2026)
by: CocoaBench Team, et al.
Published: (2026)
Similar Items
-
Evaluation Framework for AI Systems in "the Wild"
by: Jabbour, Sarah, et al.
Published: (2025) -
Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation
by: Bhonsle, Roshita, et al.
Published: (2025) -
From Failure Modes to Reliability Awareness in Generative and Agentic AI System
by: Janet, et al.
Published: (2025) -
An Agentic Evaluation Framework for AI-Generated Scientific Code in PETSc
by: Zhang, Hong, et al.
Published: (2026) -
Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems
by: Mehta, Sushant
Published: (2025)