Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nguyen, Tien-Phat, Nguyen, Truong, Nguyen, Thin, Nguyen, Duy Minh Ho, Dinh, Ngoc-Thanh, Le, Trung
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.12339
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Aligning language models for both helpfulness and safety typically requires complex pipelines-separate reward and cost models, online reinforcement learning, and primal-dual updates. Recent direct preference optimization approaches simplify training but incorporate safety through ad-hoc modifications such as multi-stage procedures or heuristic margin terms, lacking a principled derivation. We show that the likelihood ratio of the optimal safe policy admits a closed-form decomposition that reduces safety alignment to a density ratio matching problem. Minimizing Bregman divergences between the data and model ratios yields Bregman Safety Optimization (BSO), a family of single-stage loss functions, each induced by a convex generator, that provably recover the optimal safe policy. BSO is both general and simple: it requires no auxiliary models, introduces only one hyperparameter beyond standard preference optimization, and recovers existing safety-aware methods as special cases. Experiments across safety alignment benchmarks show that BSO consistently improves the safety-helpfulness trade-off.

Similar Items