Salvato in:
Dettagli Bibliografici
Autori principali: Shen, Yiyang, Tu, Lifu, Wang, Weiran
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2604.02621
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866913001100017664
author Shen, Yiyang
Tu, Lifu
Wang, Weiran
author_facet Shen, Yiyang
Tu, Lifu
Wang, Weiran
contents Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.
format Preprint
id arxiv_https___arxiv_org_abs_2604_02621
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge
Shen, Yiyang
Tu, Lifu
Wang, Weiran
Computation and Language
Machine Learning
Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.
title Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2604.02621