Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Huang, ShiYing, Lin, Liang, Li, Yuer, Luo, Kaiwen, Zhou, Zhenhong, Zhang, An, Dong, Junhao, Wang, Kun, Zeng, Zhigang
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.11679
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909039032532992
author	Huang, ShiYing Lin, Liang Li, Yuer Luo, Kaiwen Zhou, Zhenhong Zhang, An Dong, Junhao Wang, Kun Zeng, Zhigang
author_facet	Huang, ShiYing Lin, Liang Li, Yuer Luo, Kaiwen Zhou, Zhenhong Zhang, An Dong, Junhao Wang, Kun Zeng, Zhigang
contents	In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://github.com/Shiying-Huang/MORA-MPA.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_11679
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion Huang, ShiYing Lin, Liang Li, Yuer Luo, Kaiwen Zhou, Zhenhong Zhang, An Dong, Junhao Wang, Kun Zeng, Zhigang Artificial Intelligence In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://github.com/Shiying-Huang/MORA-MPA.
title	Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.11679

Similar Items