Saved in:
Bibliographic Details
Main Authors: Huang, ShiYing, Lin, Liang, Li, Yuer, Luo, Kaiwen, Zhou, Zhenhong, Zhang, An, Dong, Junhao, Wang, Kun, Zeng, Zhigang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.11679
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909039032532992
author Huang, ShiYing
Lin, Liang
Li, Yuer
Luo, Kaiwen
Zhou, Zhenhong
Zhang, An
Dong, Junhao
Wang, Kun
Zeng, Zhigang
author_facet Huang, ShiYing
Lin, Liang
Li, Yuer
Luo, Kaiwen
Zhou, Zhenhong
Zhang, An
Dong, Junhao
Wang, Kun
Zeng, Zhigang
contents In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://github.com/Shiying-Huang/MORA-MPA.
format Preprint
id arxiv_https___arxiv_org_abs_2605_11679
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
Huang, ShiYing
Lin, Liang
Li, Yuer
Luo, Kaiwen
Zhou, Zhenhong
Zhang, An
Dong, Junhao
Wang, Kun
Zeng, Zhigang
Artificial Intelligence
In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://github.com/Shiying-Huang/MORA-MPA.
title Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
topic Artificial Intelligence
url https://arxiv.org/abs/2605.11679