Saved in:
Bibliographic Details
Main Authors: Peng, ShengYun, Smith, Eric, Evtimov, Ivan, Jiang, Song, Chen, Pin-Yu, Zhan, Hongyuan, Wang, Haozhu, Chau, Duen Horng, Pasupuleti, Mahesh, Chi, Jianfeng
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.00938
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910117825347584
author Peng, ShengYun
Smith, Eric
Evtimov, Ivan
Jiang, Song
Chen, Pin-Yu
Zhan, Hongyuan
Wang, Haozhu
Chau, Duen Horng
Pasupuleti, Mahesh
Chi, Jianfeng
author_facet Peng, ShengYun
Smith, Eric
Evtimov, Ivan
Jiang, Song
Chen, Pin-Yu
Zhan, Hongyuan
Wang, Haozhu
Chau, Duen Horng
Pasupuleti, Mahesh
Chi, Jianfeng
contents Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.
format Preprint
id arxiv_https___arxiv_org_abs_2510_00938
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Large Reasoning Models Learn Better Alignment from Flawed Thinking
Peng, ShengYun
Smith, Eric
Evtimov, Ivan
Jiang, Song
Chen, Pin-Yu
Zhan, Hongyuan
Wang, Haozhu
Chau, Duen Horng
Pasupuleti, Mahesh
Chi, Jianfeng
Machine Learning
Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.
title Large Reasoning Models Learn Better Alignment from Flawed Thinking
topic Machine Learning
url https://arxiv.org/abs/2510.00938