Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Lee, Seanie, Park, Sangwoo, Choi, Yumin, Kim, Gyeongman, Kang, Minki, Yun, Jihun, Park, Dongmin, Park, Jongho, Hwang, Sung Ju
Formato:	Preprint
Publicado:	2026
Materias:	Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2601.23143
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866914562255618048
author	Lee, Seanie Park, Sangwoo Choi, Yumin Kim, Gyeongman Kang, Minki Yun, Jihun Park, Dongmin Park, Jongho Hwang, Sung Ju
author_facet	Lee, Seanie Park, Sangwoo Choi, Yumin Kim, Gyeongman Kang, Minki Yun, Jihun Park, Dongmin Park, Jongho Hwang, Sung Ju
contents	Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We formalize safety realignment as a KL projection onto the safe simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty. Guided by this analysis, we propose ThinkSafe, a self-generated alignment framework that restores safety without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, which preserves the KL-optimal target while increasing the acceptance rate. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency, and achieves superior safety and comparable reasoning to GRPO with roughly an order of magnitude less compute. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe and https://huggingface.co/Seanie-lee/collections.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_23143
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	THINKSAFE: Self-Generated Safety Alignment for Reasoning Models Lee, Seanie Park, Sangwoo Choi, Yumin Kim, Gyeongman Kang, Minki Yun, Jihun Park, Dongmin Park, Jongho Hwang, Sung Ju Artificial Intelligence Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We formalize safety realignment as a KL projection onto the safe simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty. Guided by this analysis, we propose ThinkSafe, a self-generated alignment framework that restores safety without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, which preserves the KL-optimal target while increasing the acceptance rate. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency, and achieves superior safety and comparable reasoning to GRPO with roughly an order of magnitude less compute. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe and https://huggingface.co/Seanie-lee/collections.
title	THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
topic	Artificial Intelligence
url	https://arxiv.org/abs/2601.23143

Ejemplares similares