Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.11679 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866909039032532992 |
|---|---|
| author | Huang, ShiYing Lin, Liang Li, Yuer Luo, Kaiwen Zhou, Zhenhong Zhang, An Dong, Junhao Wang, Kun Zeng, Zhigang |
| author_facet | Huang, ShiYing Lin, Liang Li, Yuer Luo, Kaiwen Zhou, Zhenhong Zhang, An Dong, Junhao Wang, Kun Zeng, Zhigang |
| contents | In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://github.com/Shiying-Huang/MORA-MPA. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2605_11679 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion Huang, ShiYing Lin, Liang Li, Yuer Luo, Kaiwen Zhou, Zhenhong Zhang, An Dong, Junhao Wang, Kun Zeng, Zhigang Artificial Intelligence In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://github.com/Shiying-Huang/MORA-MPA. |
| title | Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion |
| topic | Artificial Intelligence |
| url | https://arxiv.org/abs/2605.11679 |