MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Shen, Yiyang, Tu, Lifu, Wang, Weiran
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Computation and Language Machine Learning
Accesso online:	https://arxiv.org/abs/2604.02621
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866913001100017664
author	Shen, Yiyang Tu, Lifu Wang, Weiran
author_facet	Shen, Yiyang Tu, Lifu Wang, Weiran
contents	Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_02621
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge Shen, Yiyang Tu, Lifu Wang, Weiran Computation and Language Machine Learning Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.
title	Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2604.02621

Documenti analoghi