Saved in:
| Main Authors: | Holmes, Matthew, Lacerda, Thiago, Schwartz, Reva |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.06811 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Real-World AI Evaluation: How FRAME Generates Systematic Evidence to Resolve the Decision-Maker's Dilemma
by: Schwartz, Reva, et al.
Published: (2026)
by: Schwartz, Reva, et al.
Published: (2026)
CIRCLE: A Framework for Evaluating AI from a Real-World Lens
by: Schwartz, Reva, et al.
Published: (2026)
by: Schwartz, Reva, et al.
Published: (2026)
Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI's Real World Effects
by: Schwartz, Reva, et al.
Published: (2025)
by: Schwartz, Reva, et al.
Published: (2025)
Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts
by: Kryshtal, Andrii
Published: (2026)
by: Kryshtal, Andrii
Published: (2026)
Generation, Evaluation, and Explanation of Novelists' Styles with Single-Token Prompts
by: Rezaei, Mosab, et al.
Published: (2025)
by: Rezaei, Mosab, et al.
Published: (2025)
Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone
by: Vishwarupe, Varad, et al.
Published: (2026)
by: Vishwarupe, Varad, et al.
Published: (2026)
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
by: Griffin, Charlie, et al.
Published: (2024)
by: Griffin, Charlie, et al.
Published: (2024)
Dynamic Context-Aware Prompt Recommendation for Domain-Specific AI Applications
by: Tang, Xinye, et al.
Published: (2025)
by: Tang, Xinye, et al.
Published: (2025)
12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation
by: Ersoz, Ahmet Bahaddin
Published: (2026)
by: Ersoz, Ahmet Bahaddin
Published: (2026)
Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality, Model, and Guardrail Specification
by: Wilson, Sarah, et al.
Published: (2026)
by: Wilson, Sarah, et al.
Published: (2026)
Internal Deployment Gaps in AI Regulation
by: Kwon, Joe, et al.
Published: (2026)
by: Kwon, Joe, et al.
Published: (2026)
Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement
by: Gallego, Víctor
Published: (2025)
by: Gallego, Víctor
Published: (2025)
Bridging Protocol and Production: Design Patterns for Deploying AI Agents with Model Context Protocol
by: Srinivasan, Vasundra
Published: (2026)
by: Srinivasan, Vasundra
Published: (2026)
Enhancing Multi-Agent Communication through Attention Steering with Context Relevance
by: Zhang, Hongxiang, et al.
Published: (2026)
by: Zhang, Hongxiang, et al.
Published: (2026)
A Field Guide to Deploying AI Agents in Clinical Practice
by: Gallifant, Jack, et al.
Published: (2025)
by: Gallifant, Jack, et al.
Published: (2025)
Solving Context Window Overflow in AI Agents
by: Labate, Anton Bulle, et al.
Published: (2025)
by: Labate, Anton Bulle, et al.
Published: (2025)
DAO-AI: Evaluating Collective Decision-Making through Agentic AI in Decentralized Governance
by: Capponi, Agostino, et al.
Published: (2025)
by: Capponi, Agostino, et al.
Published: (2025)
RAN Cortex: Memory-Augmented Intelligence for Context-Aware Decision-Making in AI-Native Networks
by: Barros, Sebastian
Published: (2025)
by: Barros, Sebastian
Published: (2025)
Monitoring Deployed AI Systems in Health Care
by: Keyes, Timothy, et al.
Published: (2025)
by: Keyes, Timothy, et al.
Published: (2025)
Relevance-driven Decision Making for Safer and More Efficient Human Robot Collaboration
by: Zhang, Xiaotong, et al.
Published: (2024)
by: Zhang, Xiaotong, et al.
Published: (2024)
Responsible Evaluation of AI for Mental Health
by: Arnaout, Hiba, et al.
Published: (2026)
by: Arnaout, Hiba, et al.
Published: (2026)
PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents
by: Arman, Shifat E., et al.
Published: (2026)
by: Arman, Shifat E., et al.
Published: (2026)
Safety Must Precede the Deployment of Open-Ended AI
by: Sheth, Ivaxi, et al.
Published: (2025)
by: Sheth, Ivaxi, et al.
Published: (2025)
Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning
by: Linze, Chen, et al.
Published: (2026)
by: Linze, Chen, et al.
Published: (2026)
Interactive AI Alignment: Specification, Process, and Evaluation Alignment
by: Terry, Michael, et al.
Published: (2023)
by: Terry, Michael, et al.
Published: (2023)
Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models
by: Shu, Lei, et al.
Published: (2025)
by: Shu, Lei, et al.
Published: (2025)
Ask What Your Country Can Do For You: Towards a Public Red Teaming Model
by: Kennedy, Wm. Matthew, et al.
Published: (2025)
by: Kennedy, Wm. Matthew, et al.
Published: (2025)
The Ethics of AI in Education
by: Porayska-Pomsta, Kaska, et al.
Published: (2024)
by: Porayska-Pomsta, Kaska, et al.
Published: (2024)
CATCODER: Repository-Level Code Generation with Relevant Code and Type Context
by: Pan, Zhiyuan, et al.
Published: (2024)
by: Pan, Zhiyuan, et al.
Published: (2024)
Real-world Deployment and Evaluation of PErioperative AI CHatbot (PEACH) -- a Large Language Model Chatbot for Perioperative Medicine
by: Ke, Yu He, et al.
Published: (2024)
by: Ke, Yu He, et al.
Published: (2024)
AI2Agent: An End-to-End Framework for Deploying AI Projects as Autonomous Agents
by: Chen, Jiaxiang, et al.
Published: (2025)
by: Chen, Jiaxiang, et al.
Published: (2025)
CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models
by: Le, Yifan, et al.
Published: (2026)
by: Le, Yifan, et al.
Published: (2026)
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
by: Liu, Sheng, et al.
Published: (2023)
by: Liu, Sheng, et al.
Published: (2023)
The Deployment Gap in AI Media Detection: Platform-Aware and Visually Constrained Adversarial Evaluation
by: Budhkar, Aishwarya, et al.
Published: (2026)
by: Budhkar, Aishwarya, et al.
Published: (2026)
Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs
by: Tian, Runchu, et al.
Published: (2024)
by: Tian, Runchu, et al.
Published: (2024)
Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities
by: Bertsch, Amanda, et al.
Published: (2025)
by: Bertsch, Amanda, et al.
Published: (2025)
Contextual Moral Value Alignment Through Context-Based Aggregation
by: Dognin, Pierre, et al.
Published: (2024)
by: Dognin, Pierre, et al.
Published: (2024)
XChoice: Explainable Evaluation of AI-Human Alignment in LLM-based Constrained Choice Decision Making
by: Qi, Weihong, et al.
Published: (2026)
by: Qi, Weihong, et al.
Published: (2026)
Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents
by: Meng, M.
Published: (2026)
by: Meng, M.
Published: (2026)
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation
by: Arrieta, Aitor, et al.
Published: (2025)
by: Arrieta, Aitor, et al.
Published: (2025)
Similar Items
-
Real-World AI Evaluation: How FRAME Generates Systematic Evidence to Resolve the Decision-Maker's Dilemma
by: Schwartz, Reva, et al.
Published: (2026) -
CIRCLE: A Framework for Evaluating AI from a Real-World Lens
by: Schwartz, Reva, et al.
Published: (2026) -
Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI's Real World Effects
by: Schwartz, Reva, et al.
Published: (2025) -
Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts
by: Kryshtal, Andrii
Published: (2026) -
Generation, Evaluation, and Explanation of Novelists' Styles with Single-Token Prompts
by: Rezaei, Mosab, et al.
Published: (2025)