Saved in:
| Main Author: | Cristofano, Tony |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.16034 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning
by: Cristofano, Tony
Published: (2026)
by: Cristofano, Tony
Published: (2026)
Refusal Direction is Universal Across Safety-Aligned Languages
by: Wang, Xinpeng, et al.
Published: (2025)
by: Wang, Xinpeng, et al.
Published: (2025)
Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment
by: Li, Jie, et al.
Published: (2024)
by: Li, Jie, et al.
Published: (2024)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
by: Yuan, Youliang, et al.
Published: (2024)
by: Yuan, Youliang, et al.
Published: (2024)
RepIt: Steering Language Models with Concept-Specific Refusal Vectors
by: Siu, Vincent, et al.
Published: (2025)
by: Siu, Vincent, et al.
Published: (2025)
LLMs Encode Harmfulness and Refusal Separately
by: Zhao, Jiachen, et al.
Published: (2025)
by: Zhao, Jiachen, et al.
Published: (2025)
Refusal in LLMs is an Affine Function
by: Marshall, Thomas, et al.
Published: (2024)
by: Marshall, Thomas, et al.
Published: (2024)
Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer
by: Shao, Shun, et al.
Published: (2026)
by: Shao, Shun, et al.
Published: (2026)
Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs
by: Maskey, Utsav, et al.
Published: (2026)
by: Maskey, Utsav, et al.
Published: (2026)
Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs
by: Yuan, Shuzhou, et al.
Published: (2025)
by: Yuan, Shuzhou, et al.
Published: (2025)
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
by: Xu, Yiheng, et al.
Published: (2024)
by: Xu, Yiheng, et al.
Published: (2024)
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
by: Hu, Xulin, et al.
Published: (2026)
by: Hu, Xulin, et al.
Published: (2026)
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior
by: Si, Shengyun, et al.
Published: (2025)
by: Si, Shengyun, et al.
Published: (2025)
Benchmarking Concept-Spilling Across Languages in LLMs
by: Badanin, Ilia, et al.
Published: (2026)
by: Badanin, Ilia, et al.
Published: (2026)
$C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal
by: Kasliwal, Aditya, et al.
Published: (2026)
by: Kasliwal, Aditya, et al.
Published: (2026)
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
by: Wollschläger, Tom, et al.
Published: (2025)
by: Wollschläger, Tom, et al.
Published: (2025)
RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs
by: Nguyen, Tuan T., et al.
Published: (2025)
by: Nguyen, Tuan T., et al.
Published: (2025)
Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks
by: Pan, Wenbo, et al.
Published: (2025)
by: Pan, Wenbo, et al.
Published: (2025)
Language Model Circuits Are Sparse in the Neuron Basis
by: Arora, Aryaman, et al.
Published: (2026)
by: Arora, Aryaman, et al.
Published: (2026)
Silenced Biases: The Dark Side LLMs Learned to Refuse
by: Himelstein, Rom, et al.
Published: (2025)
by: Himelstein, Rom, et al.
Published: (2025)
Learning to Refuse: Towards Mitigating Privacy Risks in LLMs
by: Liu, Zhenhua, et al.
Published: (2024)
by: Liu, Zhenhua, et al.
Published: (2024)
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
by: von Recum, Alexander, et al.
Published: (2024)
by: von Recum, Alexander, et al.
Published: (2024)
Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models
by: Jain, Neel, et al.
Published: (2024)
by: Jain, Neel, et al.
Published: (2024)
Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models
by: Yoon, Eunseop, et al.
Published: (2025)
by: Yoon, Eunseop, et al.
Published: (2025)
Do Reasoning LLMs Refuse What They Infer in Long Contexts?
by: Fu, Yu, et al.
Published: (2026)
by: Fu, Yu, et al.
Published: (2026)
Does Refusal Training in LLMs Generalize to the Past Tense?
by: Andriushchenko, Maksym, et al.
Published: (2024)
by: Andriushchenko, Maksym, et al.
Published: (2024)
AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
by: Ding, Liang
Published: (2026)
by: Ding, Liang
Published: (2026)
Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse
by: Song, Maojia, et al.
Published: (2024)
by: Song, Maojia, et al.
Published: (2024)
Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts
by: Huang, Youcheng, et al.
Published: (2025)
by: Huang, Youcheng, et al.
Published: (2025)
RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models
by: Muhamed, Aashiq, et al.
Published: (2025)
by: Muhamed, Aashiq, et al.
Published: (2025)
Contrastive Cross-Course Knowledge Tracing via Concept Graph Guided Knowledge Transfer
by: Han, Wenkang, et al.
Published: (2025)
by: Han, Wenkang, et al.
Published: (2025)
Understanding Refusal in Language Models with Sparse Autoencoders
by: Yeo, Wei Jie, et al.
Published: (2025)
by: Yeo, Wei Jie, et al.
Published: (2025)
Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization
by: Zhao, Jihao, et al.
Published: (2026)
by: Zhao, Jihao, et al.
Published: (2026)
SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering
by: Maskey, Utsav, et al.
Published: (2025)
by: Maskey, Utsav, et al.
Published: (2025)
PAL: Probing Audio Encoders via LLMs -- Audio Information Transfer into LLMs
by: Alex, Tony, et al.
Published: (2025)
by: Alex, Tony, et al.
Published: (2025)
Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models
by: Zhao, Xinyu, et al.
Published: (2025)
by: Zhao, Xinyu, et al.
Published: (2025)
Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability
by: Chhabra, Vishnu Kabir, et al.
Published: (2025)
by: Chhabra, Vishnu Kabir, et al.
Published: (2025)
Characterizing Selective Refusal Bias in Large Language Models
by: Khorramrouz, Adel, et al.
Published: (2025)
by: Khorramrouz, Adel, et al.
Published: (2025)
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
by: Han, Seungju, et al.
Published: (2024)
by: Han, Seungju, et al.
Published: (2024)
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning
by: Zhang, Zhehao, et al.
Published: (2025)
by: Zhang, Zhehao, et al.
Published: (2025)
Similar Items
-
Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning
by: Cristofano, Tony
Published: (2026) -
Refusal Direction is Universal Across Safety-Aligned Languages
by: Wang, Xinpeng, et al.
Published: (2025) -
Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment
by: Li, Jie, et al.
Published: (2024) -
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
by: Yuan, Youliang, et al.
Published: (2024) -
RepIt: Steering Language Models with Concept-Specific Refusal Vectors
by: Siu, Vincent, et al.
Published: (2025)