Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Lijun, Li, Lin, Wei, Wei, Qi, Yajie, Song, Huizhong, Wang, Jun, Yang, Yaodong, Liang, Jiye
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2512.24263
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911346973474816
author	Zhang, Lijun Li, Lin Wei, Wei Qi, Yajie Song, Huizhong Wang, Jun Yang, Yaodong Liang, Jiye
author_facet	Zhang, Lijun Li, Lin Wei, Wei Qi, Yajie Song, Huizhong Wang, Jun Yang, Yaodong Liang, Jiye
contents	When fine-tuning pre-trained Language Models (LMs) to exhibit desired behaviors, maintaining control over risk is critical for ensuring both safety and trustworthiness. Most existing safety alignment methods, such as Safe RLHF and SACPO, typically operate under a risk-neutral paradigm that is insufficient to address the risks arising from deviations from the reference policy and offers limited robustness against rare but potentially catastrophic harmful behaviors. To address this limitation, we propose Risk-aware Stepwise Alignment (RSA), a novel alignment method that explicitly incorporates risk awareness into the policy optimization process by leveraging a class of nested risk measures. Specifically, RSA formulates safety alignment as a token-level risk-aware constrained policy optimization problem and solves it through a stepwise alignment procedure that yields token-level policy updates derived from the nested risk measures. This design offers two key benefits: (1) it mitigates risks induced by excessive model shift away from a reference policy, and (2) it explicitly suppresses low-probability yet high-impact harmful behaviors. Moreover, we provide theoretical analysis on policy optimality under mild assumptions. Experimental results demonstrate that our method achieves high levels of helpfulness while ensuring strong safety and significantly suppresses tail risks, namely low-probability yet high-impact unsafe responses.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_24263
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Constrained Language Model Policy Optimization via Risk-aware Stepwise Alignment Zhang, Lijun Li, Lin Wei, Wei Qi, Yajie Song, Huizhong Wang, Jun Yang, Yaodong Liang, Jiye Artificial Intelligence When fine-tuning pre-trained Language Models (LMs) to exhibit desired behaviors, maintaining control over risk is critical for ensuring both safety and trustworthiness. Most existing safety alignment methods, such as Safe RLHF and SACPO, typically operate under a risk-neutral paradigm that is insufficient to address the risks arising from deviations from the reference policy and offers limited robustness against rare but potentially catastrophic harmful behaviors. To address this limitation, we propose Risk-aware Stepwise Alignment (RSA), a novel alignment method that explicitly incorporates risk awareness into the policy optimization process by leveraging a class of nested risk measures. Specifically, RSA formulates safety alignment as a token-level risk-aware constrained policy optimization problem and solves it through a stepwise alignment procedure that yields token-level policy updates derived from the nested risk measures. This design offers two key benefits: (1) it mitigates risks induced by excessive model shift away from a reference policy, and (2) it explicitly suppresses low-probability yet high-impact harmful behaviors. Moreover, we provide theoretical analysis on policy optimality under mild assumptions. Experimental results demonstrate that our method achieves high levels of helpfulness while ensuring strong safety and significantly suppresses tail risks, namely low-probability yet high-impact unsafe responses.
title	Constrained Language Model Policy Optimization via Risk-aware Stepwise Alignment
topic	Artificial Intelligence
url	https://arxiv.org/abs/2512.24263

Similar Items