Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Qi, Han, Yang, Haochen, Zhang, Qiaosheng, Yang, Zhuoran
Formato:	Preprint
Publicado:	2025
Materias:	Machine Learning
Acceso en línea:	https://arxiv.org/abs/2502.05434
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866916885846556672
author	Qi, Han Yang, Haochen Zhang, Qiaosheng Yang, Zhuoran
author_facet	Qi, Han Yang, Haochen Zhang, Qiaosheng Yang, Zhuoran
contents	We study the problem of reinforcement learning from human feedback (RLHF), a critical problem in training large language models, from a theoretical perspective. Our main contribution is the design of novel sample-efficient RLHF algorithms based on information-directed sampling (IDS), an online decision-making principle inspired by information theory. Our algorithms maximize the sum of the value function and a mutual information term that encourages exploration of the unknown environment (which quantifies the information gained about the environment through observed human feedback data). To tackle the challenge of large state spaces and improve sample efficiency, we construct a simplified \emph{surrogate environment} and introduce a novel distance measure (named the \emph{$\ell_g$-distance}), enabling our IDS-based algorithm to achieve a Bayesian regret upper bound of order $O(H^{\frac{3}{2}}\sqrt{\log(K(ε)) T})$, where $H$ is the episode length, $T$ is the number of episode and $K(ε)$ is related to the covering number of the environment. Specializing to the tabular settings, this regret bound is of order $\tilde{O}(H^2\sqrt{SAT})$, where $S$ and $A$ are the numbers of states and actions. Finally, we propose an Approximate-IDS algorithm that is computationally more efficient while maintaining nearly the same sample efficiency. The design principle of this approximate algorithm is not only effective in RLHF settings but also applicable to the standard RL framework. Moreover, our work showcases the value of information theory in reinforcement learning and in the training of large language models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_05434
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling Qi, Han Yang, Haochen Zhang, Qiaosheng Yang, Zhuoran Machine Learning We study the problem of reinforcement learning from human feedback (RLHF), a critical problem in training large language models, from a theoretical perspective. Our main contribution is the design of novel sample-efficient RLHF algorithms based on information-directed sampling (IDS), an online decision-making principle inspired by information theory. Our algorithms maximize the sum of the value function and a mutual information term that encourages exploration of the unknown environment (which quantifies the information gained about the environment through observed human feedback data). To tackle the challenge of large state spaces and improve sample efficiency, we construct a simplified \emph{surrogate environment} and introduce a novel distance measure (named the \emph{$\ell_g$-distance}), enabling our IDS-based algorithm to achieve a Bayesian regret upper bound of order $O(H^{\frac{3}{2}}\sqrt{\log(K(ε)) T})$, where $H$ is the episode length, $T$ is the number of episode and $K(ε)$ is related to the covering number of the environment. Specializing to the tabular settings, this regret bound is of order $\tilde{O}(H^2\sqrt{SAT})$, where $S$ and $A$ are the numbers of states and actions. Finally, we propose an Approximate-IDS algorithm that is computationally more efficient while maintaining nearly the same sample efficiency. The design principle of this approximate algorithm is not only effective in RLHF settings but also applicable to the standard RL framework. Moreover, our work showcases the value of information theory in reinforcement learning and in the training of large language models.
title	Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling
topic	Machine Learning
url	https://arxiv.org/abs/2502.05434

Ejemplares similares