Guardado en:
Detalles Bibliográficos
Autores principales: Lee, Seanie, Park, Sangwoo, Choi, Yumin, Kim, Gyeongman, Kang, Minki, Yun, Jihun, Park, Dongmin, Park, Jongho, Hwang, Sung Ju
Formato: Preprint
Publicado: 2026
Materias:
Acceso en línea:https://arxiv.org/abs/2601.23143
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866914562255618048
author Lee, Seanie
Park, Sangwoo
Choi, Yumin
Kim, Gyeongman
Kang, Minki
Yun, Jihun
Park, Dongmin
Park, Jongho
Hwang, Sung Ju
author_facet Lee, Seanie
Park, Sangwoo
Choi, Yumin
Kim, Gyeongman
Kang, Minki
Yun, Jihun
Park, Dongmin
Park, Jongho
Hwang, Sung Ju
contents Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We formalize safety realignment as a KL projection onto the safe simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty. Guided by this analysis, we propose ThinkSafe, a self-generated alignment framework that restores safety without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, which preserves the KL-optimal target while increasing the acceptance rate. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency, and achieves superior safety and comparable reasoning to GRPO with roughly an order of magnitude less compute. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe and https://huggingface.co/Seanie-lee/collections.
format Preprint
id arxiv_https___arxiv_org_abs_2601_23143
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
Lee, Seanie
Park, Sangwoo
Choi, Yumin
Kim, Gyeongman
Kang, Minki
Yun, Jihun
Park, Dongmin
Park, Jongho
Hwang, Sung Ju
Artificial Intelligence
Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We formalize safety realignment as a KL projection onto the safe simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty. Guided by this analysis, we propose ThinkSafe, a self-generated alignment framework that restores safety without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, which preserves the KL-optimal target while increasing the acceptance rate. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency, and achieves superior safety and comparable reasoning to GRPO with roughly an order of magnitude less compute. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe and https://huggingface.co/Seanie-lee/collections.
title THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
topic Artificial Intelligence
url https://arxiv.org/abs/2601.23143