Saved in:
| Main Author: | Cristofano, Tony |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.08489 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction
by: Cristofano, Tony
Published: (2026)
by: Cristofano, Tony
Published: (2026)
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
by: Wang, Xinpeng, et al.
Published: (2024)
by: Wang, Xinpeng, et al.
Published: (2024)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
by: Yuan, Youliang, et al.
Published: (2024)
by: Yuan, Youliang, et al.
Published: (2024)
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior
by: Si, Shengyun, et al.
Published: (2025)
by: Si, Shengyun, et al.
Published: (2025)
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
by: Lermen, Simon, et al.
Published: (2024)
by: Lermen, Simon, et al.
Published: (2024)
Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint
by: Du, Yanrui, et al.
Published: (2025)
by: Du, Yanrui, et al.
Published: (2025)
Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks
by: Ham, Seokil, et al.
Published: (2025)
by: Ham, Seokil, et al.
Published: (2025)
Refusal Direction is Universal Across Safety-Aligned Languages
by: Wang, Xinpeng, et al.
Published: (2025)
by: Wang, Xinpeng, et al.
Published: (2025)
Should LLM Safety Be More Than Refusing Harmful Instructions?
by: Maskey, Utsav, et al.
Published: (2025)
by: Maskey, Utsav, et al.
Published: (2025)
RepIt: Steering Language Models with Concept-Specific Refusal Vectors
by: Siu, Vincent, et al.
Published: (2025)
by: Siu, Vincent, et al.
Published: (2025)
DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation
by: Jiang, Houcheng, et al.
Published: (2025)
by: Jiang, Houcheng, et al.
Published: (2025)
Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety
by: Zhang, Yuyou, et al.
Published: (2025)
by: Zhang, Yuyou, et al.
Published: (2025)
When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals
by: Anonto, Riad Ahmed, et al.
Published: (2025)
by: Anonto, Riad Ahmed, et al.
Published: (2025)
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
by: Wollschläger, Tom, et al.
Published: (2025)
by: Wollschläger, Tom, et al.
Published: (2025)
From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training
by: Yuan, Yuan, et al.
Published: (2025)
by: Yuan, Yuan, et al.
Published: (2025)
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
by: Han, Seungju, et al.
Published: (2024)
by: Han, Seungju, et al.
Published: (2024)
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning
by: Zhang, Zhehao, et al.
Published: (2025)
by: Zhang, Zhehao, et al.
Published: (2025)
RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs
by: Asif, Sadia, et al.
Published: (2026)
by: Asif, Sadia, et al.
Published: (2026)
Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models
by: Duan, Ranjie, et al.
Published: (2025)
by: Duan, Ranjie, et al.
Published: (2025)
Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries
by: Gondil, Tanay
Published: (2026)
by: Gondil, Tanay
Published: (2026)
Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs
by: Maskey, Utsav, et al.
Published: (2026)
by: Maskey, Utsav, et al.
Published: (2026)
Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs
by: Yuan, Shuzhou, et al.
Published: (2025)
by: Yuan, Shuzhou, et al.
Published: (2025)
When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents
by: Hadeliya, Tsimur, et al.
Published: (2025)
by: Hadeliya, Tsimur, et al.
Published: (2025)
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
by: Chae, Kyubyung, et al.
Published: (2025)
by: Chae, Kyubyung, et al.
Published: (2025)
ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts
by: Aswal, Darpan, et al.
Published: (2025)
by: Aswal, Darpan, et al.
Published: (2025)
Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models
by: Jain, Neel, et al.
Published: (2024)
by: Jain, Neel, et al.
Published: (2024)
ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal
by: Zhang, Haonan, et al.
Published: (2025)
by: Zhang, Haonan, et al.
Published: (2025)
Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI
by: Kapelko, Eduard
Published: (2025)
by: Kapelko, Eduard
Published: (2025)
LLMs Encode Harmfulness and Refusal Separately
by: Zhao, Jiachen, et al.
Published: (2025)
by: Zhao, Jiachen, et al.
Published: (2025)
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
by: García-Ferrero, Iker, et al.
Published: (2025)
by: García-Ferrero, Iker, et al.
Published: (2025)
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
by: Casademunt, Helena, et al.
Published: (2025)
by: Casademunt, Helena, et al.
Published: (2025)
RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models
by: Muhamed, Aashiq, et al.
Published: (2025)
by: Muhamed, Aashiq, et al.
Published: (2025)
Refusal in LLMs is an Affine Function
by: Marshall, Thomas, et al.
Published: (2024)
by: Marshall, Thomas, et al.
Published: (2024)
Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks
by: Pan, Wenbo, et al.
Published: (2025)
by: Pan, Wenbo, et al.
Published: (2025)
Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement
by: Lu, Yuxiao, et al.
Published: (2026)
by: Lu, Yuxiao, et al.
Published: (2026)
Don't Say No: Jailbreaking LLM by Suppressing Refusal
by: Zhou, Yukai, et al.
Published: (2024)
by: Zhou, Yukai, et al.
Published: (2024)
Understanding Refusal in Language Models with Sparse Autoencoders
by: Yeo, Wei Jie, et al.
Published: (2025)
by: Yeo, Wei Jie, et al.
Published: (2025)
Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability
by: Chhabra, Vishnu Kabir, et al.
Published: (2025)
by: Chhabra, Vishnu Kabir, et al.
Published: (2025)
Latent Concept Disentanglement in Transformer-based Language Models
by: Hong, Guan Zhe, et al.
Published: (2025)
by: Hong, Guan Zhe, et al.
Published: (2025)
Characterizing Selective Refusal Bias in Large Language Models
by: Khorramrouz, Adel, et al.
Published: (2025)
by: Khorramrouz, Adel, et al.
Published: (2025)
Similar Items
-
Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction
by: Cristofano, Tony
Published: (2026) -
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
by: Wang, Xinpeng, et al.
Published: (2024) -
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
by: Yuan, Youliang, et al.
Published: (2024) -
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior
by: Si, Shengyun, et al.
Published: (2025) -
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
by: Lermen, Simon, et al.
Published: (2024)