Saved in:
Bibliographic Details
Main Authors: Wang, Haoyu, Qin, Zeyu, Shen, Li, Wang, Xueqian, Tao, Dacheng, Cheng, Minhao
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.04040
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910975598264320
author Wang, Haoyu
Qin, Zeyu
Shen, Li
Wang, Xueqian
Tao, Dacheng
Cheng, Minhao
author_facet Wang, Haoyu
Qin, Zeyu
Shen, Li
Wang, Xueqian
Tao, Dacheng
Cheng, Minhao
contents Training safe LLMs remains a critical challenge. The most widely used method, Refusal Training (RT), struggles to generalize against various Out-of-Distribution (OOD) jailbreaking attacks. Although various advanced methods have been proposed to address this issue, we instead question whether OOD attacks inherently surpass the capability of vanilla RT. Evaluations using Best-of-N (BoN) reveal significant safety improvements as N increases, indicating models possess adequate latent safety knowledge but RT fails to consistently elicit it under OOD scenarios. Further domain adaptation analysis reveals that direct RT causes reliance on superficial shortcuts, resulting in non-generalizable representation mappings. Inspired by our findings, we propose training model to perform safety reasoning for each query. Specifically, we synthesize reasoning supervision aligned with specified guidelines that reflect diverse perspectives on safety knowledge. This encourages model to engage in deeper reasoning, explicitly eliciting and utilizing latent safety knowledge for each query. Extensive experiments show that our method significantly improves model generalization against OOD attacks.
format Preprint
id arxiv_https___arxiv_org_abs_2502_04040
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Safety Reasoning with Guidelines
Wang, Haoyu
Qin, Zeyu
Shen, Li
Wang, Xueqian
Tao, Dacheng
Cheng, Minhao
Machine Learning
Artificial Intelligence
Computation and Language
Training safe LLMs remains a critical challenge. The most widely used method, Refusal Training (RT), struggles to generalize against various Out-of-Distribution (OOD) jailbreaking attacks. Although various advanced methods have been proposed to address this issue, we instead question whether OOD attacks inherently surpass the capability of vanilla RT. Evaluations using Best-of-N (BoN) reveal significant safety improvements as N increases, indicating models possess adequate latent safety knowledge but RT fails to consistently elicit it under OOD scenarios. Further domain adaptation analysis reveals that direct RT causes reliance on superficial shortcuts, resulting in non-generalizable representation mappings. Inspired by our findings, we propose training model to perform safety reasoning for each query. Specifically, we synthesize reasoning supervision aligned with specified guidelines that reflect diverse perspectives on safety knowledge. This encourages model to engage in deeper reasoning, explicitly eliciting and utilizing latent safety knowledge for each query. Extensive experiments show that our method significantly improves model generalization against OOD attacks.
title Safety Reasoning with Guidelines
topic Machine Learning
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2502.04040