Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jiang, Enyi, Zhang, Yibo Jacky, Xu, Yinglun, Haupt, Andreas, Amato, Nancy, Koyejo, Sanmi
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.22083
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917240813649920
author	Jiang, Enyi Zhang, Yibo Jacky Xu, Yinglun Haupt, Andreas Amato, Nancy Koyejo, Sanmi
author_facet	Jiang, Enyi Zhang, Yibo Jacky Xu, Yinglun Haupt, Andreas Amato, Nancy Koyejo, Sanmi
contents	Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_22083
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Latent Adversarial Regularization for Offline Preference Optimization Jiang, Enyi Zhang, Yibo Jacky Xu, Yinglun Haupt, Andreas Amato, Nancy Koyejo, Sanmi Machine Learning Artificial Intelligence Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.
title	Latent Adversarial Regularization for Offline Preference Optimization
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2601.22083

Similar Items