Guardado en:
Detalles Bibliográficos
Autores principales: Shi, Ruizhe, Zhou, Runlong, Du, Simon S.
Formato: Preprint
Publicado: 2024
Materias:
Acceso en línea:https://arxiv.org/abs/2409.19605
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Tabla de Contenidos:
  • Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the optimization properties, particularly the impact of samplers on its convergence rates, remain under-explored. In this paper, we provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves $\textbf{linear}$ convergence, while our proposed online sampler achieves $\textbf{quadratic}$ convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating improvements over previous methods. For example, it outperforms vanilla DPO by over $7.4$% on Safe-RLHF dataset. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs.