Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kasliwal, Aditya, Seth, Pratinav, Sankarapu, Vinay Kumar
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Emerging Technologies
Online Access:	https://arxiv.org/abs/2602.04521
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.

Similar Items