Tabla de Contenidos: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Shi, Ruizhe, Zhou, Runlong, Du, Simon S.
Formato:	Preprint
Publicado:	2024
Materias:	Machine Learning Computation and Language
Acceso en línea:	https://arxiv.org/abs/2409.19605
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Tabla de Contenidos:

Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the optimization properties, particularly the impact of samplers on its convergence rates, remain under-explored. In this paper, we provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves $\textbf{linear}$ convergence, while our proposed online sampler achieves $\textbf{quadratic}$ convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating improvements over previous methods. For example, it outperforms vanilla DPO by over $7.4$% on Safe-RLHF dataset. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs.

Ejemplares similares