Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Soroka, Emi, Chopra, Tanmay, Desai, Krish, Lall, Sanjay
Format:	Preprint
Published:	2025
Subjects:	Machine Learning I.2.7
Online Access:	https://arxiv.org/abs/2511.03047
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914137324388352
author	Soroka, Emi Chopra, Tanmay Desai, Krish Lall, Sanjay
author_facet	Soroka, Emi Chopra, Tanmay Desai, Krish Lall, Sanjay
contents	Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_03047
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions Soroka, Emi Chopra, Tanmay Desai, Krish Lall, Sanjay Machine Learning I.2.7 Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.
title	Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions
topic	Machine Learning I.2.7
url	https://arxiv.org/abs/2511.03047

Similar Items