Saved in:
Bibliographic Details
Main Authors: Soroka, Emi, Chopra, Tanmay, Desai, Krish, Lall, Sanjay
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.03047
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914137324388352
author Soroka, Emi
Chopra, Tanmay
Desai, Krish
Lall, Sanjay
author_facet Soroka, Emi
Chopra, Tanmay
Desai, Krish
Lall, Sanjay
contents Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.
format Preprint
id arxiv_https___arxiv_org_abs_2511_03047
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions
Soroka, Emi
Chopra, Tanmay
Desai, Krish
Lall, Sanjay
Machine Learning
I.2.7
Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.
title Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions
topic Machine Learning
I.2.7
url https://arxiv.org/abs/2511.03047