Salvato in:
Dettagli Bibliografici
Autori principali: Chandra, Arjun, Miller, Kevin, Ravichandran, Venkatesh, Papayiannis, Constantinos, Saligrama, Venkatesh
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2601.13742
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866908784900702208
author Chandra, Arjun
Miller, Kevin
Ravichandran, Venkatesh
Papayiannis, Constantinos
Saligrama, Venkatesh
author_facet Chandra, Arjun
Miller, Kevin
Ravichandran, Venkatesh
Papayiannis, Constantinos
Saligrama, Venkatesh
contents Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models (ALMs). In this work, we propose TRACE (Textual Reasoning over Audio Cues for Evaluation), a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation. To demonstrate the strength of the framework, we first introduce a Human Chain-of-Thought (HCoT) annotation protocol to improve the diagnostic capability of existing judge benchmarks by separating evaluation into explicit dimensions: content (C), voice quality (VQ), and paralinguistics (P). Using this data, TRACE constructs a textual blueprint of inexpensive audio signals and prompts an LLM to render dimension-wise judgments, fusing them into an overall rating via a deterministic policy. TRACE achieves higher agreement with human raters than ALMs and transcript-only LLM judges while being significantly more cost-effective. We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.
format Preprint
id arxiv_https___arxiv_org_abs_2601_13742
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Hearing Between the Lines: Unlocking the Reasoning Power of LLMs for Speech Evaluation
Chandra, Arjun
Miller, Kevin
Ravichandran, Venkatesh
Papayiannis, Constantinos
Saligrama, Venkatesh
Computation and Language
Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models (ALMs). In this work, we propose TRACE (Textual Reasoning over Audio Cues for Evaluation), a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation. To demonstrate the strength of the framework, we first introduce a Human Chain-of-Thought (HCoT) annotation protocol to improve the diagnostic capability of existing judge benchmarks by separating evaluation into explicit dimensions: content (C), voice quality (VQ), and paralinguistics (P). Using this data, TRACE constructs a textual blueprint of inexpensive audio signals and prompts an LLM to render dimension-wise judgments, fusing them into an overall rating via a deterministic policy. TRACE achieves higher agreement with human raters than ALMs and transcript-only LLM judges while being significantly more cost-effective. We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.
title Hearing Between the Lines: Unlocking the Reasoning Power of LLMs for Speech Evaluation
topic Computation and Language
url https://arxiv.org/abs/2601.13742