Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Soroka, Emi, Chopra, Tanmay, Desai, Krish, Lall, Sanjay
Format:	Preprint
Published:	2025
Subjects:	Machine Learning I.2.7
Online Access:	https://arxiv.org/abs/2511.03047
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.

Similar Items