Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Nandi, Arghodeep, Saxena, Ojasva, Chakraborty, Tanmoy
Formato:	Preprint
Publicado:	2026
Materias:	Computation and Language
Acceso en línea:	https://arxiv.org/abs/2603.01580
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866908860149661696
author	Nandi, Arghodeep Saxena, Ojasva Chakraborty, Tanmoy
author_facet	Nandi, Arghodeep Saxena, Ojasva Chakraborty, Tanmoy
contents	Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers' D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_01580
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models Nandi, Arghodeep Saxena, Ojasva Chakraborty, Tanmoy Computation and Language Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers' D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.
title	Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models
topic	Computation and Language
url	https://arxiv.org/abs/2603.01580

Ejemplares similares