Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Cai, Wang, Wen, Yilin, Hou, Jinchang, Su, Du, Wang, Guoqiu, Lv, Zhonghou, Bao, Chenfu, Wu, Yunfang
Format:	Preprint
Publié:	2026
Sujets:	Machine Learning Artificial Intelligence
Accès en ligne:	https://arxiv.org/abs/2601.04262
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866914239318327296
author	Cai, Wang Wen, Yilin Hou, Jinchang Su, Du Wang, Guoqiu Lv, Zhonghou Bao, Chenfu Wu, Yunfang
author_facet	Cai, Wang Wen, Yilin Hou, Jinchang Su, Du Wang, Guoqiu Lv, Zhonghou Bao, Chenfu Wu, Yunfang
contents	Safety alignment in Large Language Models (LLMs) inherently presents a multi-objective optimization conflict, often accompanied by an unintended degradation of general capabilities. Existing mitigation strategies typically rely on global gradient geometry to resolve these conflicts, yet they overlook Modular Heterogeneity within Transformers, specifically that the functional sensitivity and degree of conflict vary substantially across different attention heads. Such global approaches impose uniform update rules across all parameters, often resulting in suboptimal trade-offs by indiscriminately updating utility sensitive heads that exhibit intense gradient conflicts. To address this limitation, we propose Conflict-Aware Sparse Tuning (CAST), a framework that integrates head-level diagnosis with sparse fine-tuning. CAST first constructs a pre-alignment conflict map by synthesizing Optimization Conflict and Functional Sensitivity, which then guides the selective update of parameters. Experiments reveal that alignment conflicts in LLMs are not uniformly distributed. We find that the drop in general capabilities mainly comes from updating a small group of ``high-conflict'' heads. By simply skipping these heads during training, we significantly reduce this loss without compromising safety, offering an interpretable and parameter-efficient approach to improving the safety-utility trade-off.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_04262
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis Cai, Wang Wen, Yilin Hou, Jinchang Su, Du Wang, Guoqiu Lv, Zhonghou Bao, Chenfu Wu, Yunfang Machine Learning Artificial Intelligence Safety alignment in Large Language Models (LLMs) inherently presents a multi-objective optimization conflict, often accompanied by an unintended degradation of general capabilities. Existing mitigation strategies typically rely on global gradient geometry to resolve these conflicts, yet they overlook Modular Heterogeneity within Transformers, specifically that the functional sensitivity and degree of conflict vary substantially across different attention heads. Such global approaches impose uniform update rules across all parameters, often resulting in suboptimal trade-offs by indiscriminately updating utility sensitive heads that exhibit intense gradient conflicts. To address this limitation, we propose Conflict-Aware Sparse Tuning (CAST), a framework that integrates head-level diagnosis with sparse fine-tuning. CAST first constructs a pre-alignment conflict map by synthesizing Optimization Conflict and Functional Sensitivity, which then guides the selective update of parameters. Experiments reveal that alignment conflicts in LLMs are not uniformly distributed. We find that the drop in general capabilities mainly comes from updating a small group of ``high-conflict'' heads. By simply skipping these heads during training, we significantly reduce this loss without compromising safety, offering an interpretable and parameter-efficient approach to improving the safety-utility trade-off.
title	Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2601.04262

Documents similaires