Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Andres, Miguel E., Fedorov, Vadim, Sadek, Rida, Spagnolo-Arrizabalaga, Enric, Trudel, Nadescha
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.04133
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908762942472192
author	Andres, Miguel E. Fedorov, Vadim Sadek, Rida Spagnolo-Arrizabalaga, Enric Trudel, Nadescha
author_facet	Andres, Miguel E. Fedorov, Vadim Sadek, Rida Spagnolo-Arrizabalaga, Enric Trudel, Nadescha
contents	Voice AI agents are rapidly transitioning to production deployments, yet systematic methods for ensuring testing reliability remain underdeveloped. Organizations cannot objectively assess whether their testing approaches (internal tools or external platforms) actually work, creating a critical measurement gap as voice AI scales to billions of daily interactions. We present the first systematic framework for evaluating voice AI testing quality through human-centered benchmarking. Our methodology addresses the fundamental dual challenge of testing platforms: generating realistic test conversations (simulation quality) and accurately evaluating agent responses (evaluation quality). The framework combines established psychometric techniques (pairwise comparisons yielding Elo ratings, bootstrap confidence intervals, and permutation tests) with rigorous statistical validation to provide reproducible metrics applicable to any testing approach. To validate the framework and demonstrate its utility, we conducted comprehensive empirical evaluation of three leading commercial platforms focused on Voice AI Testing using 21,600 human judgments across 45 simulations and ground truth validation on 60 conversations. Results reveal statistically significant performance differences with the proposed framework, with the top-performing platform, Evalion, achieving 0.92 evaluation quality measured as f1-score versus 0.73 for others, and 0.61 simulation quality using a league based scoring system (including ties) vs 0.43 for other platforms. This framework enables researchers and organizations to empirically validate the testing capabilities of any platform, providing essential measurement foundations for confident voice AI deployment at scale. Supporting materials are made available to facilitate reproducibility and adoption.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_04133
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Testing the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms Andres, Miguel E. Fedorov, Vadim Sadek, Rida Spagnolo-Arrizabalaga, Enric Trudel, Nadescha Artificial Intelligence Voice AI agents are rapidly transitioning to production deployments, yet systematic methods for ensuring testing reliability remain underdeveloped. Organizations cannot objectively assess whether their testing approaches (internal tools or external platforms) actually work, creating a critical measurement gap as voice AI scales to billions of daily interactions. We present the first systematic framework for evaluating voice AI testing quality through human-centered benchmarking. Our methodology addresses the fundamental dual challenge of testing platforms: generating realistic test conversations (simulation quality) and accurately evaluating agent responses (evaluation quality). The framework combines established psychometric techniques (pairwise comparisons yielding Elo ratings, bootstrap confidence intervals, and permutation tests) with rigorous statistical validation to provide reproducible metrics applicable to any testing approach. To validate the framework and demonstrate its utility, we conducted comprehensive empirical evaluation of three leading commercial platforms focused on Voice AI Testing using 21,600 human judgments across 45 simulations and ground truth validation on 60 conversations. Results reveal statistically significant performance differences with the proposed framework, with the top-performing platform, Evalion, achieving 0.92 evaluation quality measured as f1-score versus 0.73 for others, and 0.61 simulation quality using a league based scoring system (including ties) vs 0.43 for other platforms. This framework enables researchers and organizations to empirically validate the testing capabilities of any platform, providing essential measurement foundations for confident voice AI deployment at scale. Supporting materials are made available to facilitate reproducibility and adoption.
title	Testing the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms
topic	Artificial Intelligence
url	https://arxiv.org/abs/2511.04133

Similar Items