Saved in:
Bibliographic Details
Main Authors: Kasliwal, Aditya, Seth, Pratinav, Sankarapu, Vinay Kumar
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.04521
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911421812441088
author Kasliwal, Aditya
Seth, Pratinav
Sankarapu, Vinay Kumar
author_facet Kasliwal, Aditya
Seth, Pratinav
Sankarapu, Vinay Kumar
contents Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.
format Preprint
id arxiv_https___arxiv_org_abs_2602_04521
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle $C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal
Kasliwal, Aditya
Seth, Pratinav
Sankarapu, Vinay Kumar
Computation and Language
Emerging Technologies
Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.
title $C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal
topic Computation and Language
Emerging Technologies
url https://arxiv.org/abs/2602.04521