Saved in:
Bibliographic Details
Main Authors: Aguirre, Nicolás, Caso, Ramiro, Colmeiro, Ramiro Rodríguez, Santelli, Mauro, Calderón, Joaquín Toranzo
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.01469
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909820559294464
author Aguirre, Nicolás
Caso, Ramiro
Colmeiro, Ramiro Rodríguez
Santelli, Mauro
Calderón, Joaquín Toranzo
author_facet Aguirre, Nicolás
Caso, Ramiro
Colmeiro, Ramiro Rodríguez
Santelli, Mauro
Calderón, Joaquín Toranzo
contents The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response classification relies on methods that are too expensive (i.e. LLM-as-a-Judge) or that are far from real-world conditions (string-matching, logprob). In this paper, a structure-free evaluation method is presented. The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text, resulting in a robust classification of the response at a relatively low compute cost (embedding models of less than $10B$ parameters). The results show a regression score of ~0.97 and an accuracy of ~96% against human annotators, tested over 3 data sets and 3 different LM architectures.
format Preprint
id arxiv_https___arxiv_org_abs_2510_01469
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle A-VERT: Agnostic Verification with Embedding Ranking Targets
Aguirre, Nicolás
Caso, Ramiro
Colmeiro, Ramiro Rodríguez
Santelli, Mauro
Calderón, Joaquín Toranzo
Computation and Language
Machine Learning
68T50
I.2.7
The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response classification relies on methods that are too expensive (i.e. LLM-as-a-Judge) or that are far from real-world conditions (string-matching, logprob). In this paper, a structure-free evaluation method is presented. The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text, resulting in a robust classification of the response at a relatively low compute cost (embedding models of less than $10B$ parameters). The results show a regression score of ~0.97 and an accuracy of ~96% against human annotators, tested over 3 data sets and 3 different LM architectures.
title A-VERT: Agnostic Verification with Embedding Ranking Targets
topic Computation and Language
Machine Learning
68T50
I.2.7
url https://arxiv.org/abs/2510.01469