Guardado en:
| Autores principales: | , , , , , , , , |
|---|---|
| Formato: | Preprint |
| Publicado: |
2026
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2601.23143 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
| _version_ | 1866914562255618048 |
|---|---|
| author | Lee, Seanie Park, Sangwoo Choi, Yumin Kim, Gyeongman Kang, Minki Yun, Jihun Park, Dongmin Park, Jongho Hwang, Sung Ju |
| author_facet | Lee, Seanie Park, Sangwoo Choi, Yumin Kim, Gyeongman Kang, Minki Yun, Jihun Park, Dongmin Park, Jongho Hwang, Sung Ju |
| contents | Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning.
However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on
external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We formalize safety realignment as a KL projection onto the safe
simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty.
Guided by this analysis, we propose ThinkSafe, a self-generated alignment framework that restores safety without external teachers. Our key insight is that while compliance
suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, which preserves the KL-optimal
target while increasing the acceptance rate. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency,
and achieves superior safety and comparable reasoning to GRPO with roughly an order of magnitude less compute. Code, models, and datasets are available at
https://github.com/seanie12/ThinkSafe and https://huggingface.co/Seanie-lee/collections. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_23143 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | THINKSAFE: Self-Generated Safety Alignment for Reasoning Models Lee, Seanie Park, Sangwoo Choi, Yumin Kim, Gyeongman Kang, Minki Yun, Jihun Park, Dongmin Park, Jongho Hwang, Sung Ju Artificial Intelligence Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We formalize safety realignment as a KL projection onto the safe simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty. Guided by this analysis, we propose ThinkSafe, a self-generated alignment framework that restores safety without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, which preserves the KL-optimal target while increasing the acceptance rate. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency, and achieves superior safety and comparable reasoning to GRPO with roughly an order of magnitude less compute. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe and https://huggingface.co/Seanie-lee/collections. |
| title | THINKSAFE: Self-Generated Safety Alignment for Reasoning Models |
| topic | Artificial Intelligence |
| url | https://arxiv.org/abs/2601.23143 |