Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Li, Zhaochun, Yi, Mingyang, Wang, Yue, Cui, Shisheng, Liu, Yong
Format: Preprint
Veröffentlicht: 2026
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2601.16403
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866911393458946048
author Li, Zhaochun
Yi, Mingyang
Wang, Yue
Cui, Shisheng
Liu, Yong
author_facet Li, Zhaochun
Yi, Mingyang
Wang, Yue
Cui, Shisheng
Liu, Yong
contents Reinforcement Learning from Human Feedback (RLHF) and its variants have emerged as the dominant approaches for aligning Large Language Models with human intent. While empirically effective, the theoretical generalization properties of these methods in high-dimensional settings remain to be explored. To this end, we build the generalization theory on RLHF of LLMs under the linear reward model, through the framework of algorithmic stability. In contrast to the existing works built upon the consistency of maximum likelihood estimations on reward model, our analysis is presented under an end-to-end learning framework, which is consistent with practice. Concretely, we prove that under a key \textbf{feature coverage} condition, the empirical optima of policy model have a generalization bound of order $\mathcal{O}(n^{-\frac{1}{2}})$. Moreover, the results can be extrapolated to parameters obtained by gradient-based learning algorithms, i.e., Gradient Ascent (GA) and Stochastic Gradient Ascent (SGA). Thus, we argue that our results provide new theoretical evidence for the empirically observed generalization of LLMs after RLHF.
format Preprint
id arxiv_https___arxiv_org_abs_2601_16403
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Towards a Theoretical Understanding to the Generalization of RLHF
Li, Zhaochun
Yi, Mingyang
Wang, Yue
Cui, Shisheng
Liu, Yong
Machine Learning
Reinforcement Learning from Human Feedback (RLHF) and its variants have emerged as the dominant approaches for aligning Large Language Models with human intent. While empirically effective, the theoretical generalization properties of these methods in high-dimensional settings remain to be explored. To this end, we build the generalization theory on RLHF of LLMs under the linear reward model, through the framework of algorithmic stability. In contrast to the existing works built upon the consistency of maximum likelihood estimations on reward model, our analysis is presented under an end-to-end learning framework, which is consistent with practice. Concretely, we prove that under a key \textbf{feature coverage} condition, the empirical optima of policy model have a generalization bound of order $\mathcal{O}(n^{-\frac{1}{2}})$. Moreover, the results can be extrapolated to parameters obtained by gradient-based learning algorithms, i.e., Gradient Ascent (GA) and Stochastic Gradient Ascent (SGA). Thus, we argue that our results provide new theoretical evidence for the empirically observed generalization of LLMs after RLHF.
title Towards a Theoretical Understanding to the Generalization of RLHF
topic Machine Learning
url https://arxiv.org/abs/2601.16403