Guardado en:
Detalles Bibliográficos
Autores principales: Kim, Su-Hyeon, Jin, Hyundong, Lee, Yejin, Han, Yo-Sub
Formato: Preprint
Publicado: 2026
Materias:
Acceso en línea:https://arxiv.org/abs/2604.01604
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866917537320534016
author Kim, Su-Hyeon
Jin, Hyundong
Lee, Yejin
Han, Yo-Sub
author_facet Kim, Su-Hyeon
Jin, Hyundong
Lee, Yejin
Han, Yo-Sub
contents While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal behavior for model safety analysis. For example, steering-based jailbreak attacks exploit this by identifying and manipulating sparse, neuron-like refusal features to bypass safety guardrails. Current feature selection methods primarily rely on how strongly features activate on harmful prompts. However, activation strength alone often captures superficial heuristics such as topic or lexical cues, rather than the true causal mechanisms. Thus, selecting refusal features requires measuring inter-feature relationships, rather than treating each feature as an isolated activation signal. Based on this insight, we propose CRaFT, a circuit-guided framework for identifying critical refusal features that directly govern the refusal decision. CRaFT leverages cross-layer transcoders to map the model's internal computations into a sparse feature circuit graph, where edges quantify inter-feature influences and their contributions to the final output logits. By aggregating the effects propagating along the paths to refusal, CRaFT effectively ranks the most influential features. Extensive evaluations across four jailbreak benchmarks show that CRaFT significantly improves average performance from 6.7% to 57.4% and generates more specific harmful completions compared to current SOTA methods.
format Preprint
id arxiv_https___arxiv_org_abs_2604_01604
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders
Kim, Su-Hyeon
Jin, Hyundong
Lee, Yejin
Han, Yo-Sub
Artificial Intelligence
While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal behavior for model safety analysis. For example, steering-based jailbreak attacks exploit this by identifying and manipulating sparse, neuron-like refusal features to bypass safety guardrails. Current feature selection methods primarily rely on how strongly features activate on harmful prompts. However, activation strength alone often captures superficial heuristics such as topic or lexical cues, rather than the true causal mechanisms. Thus, selecting refusal features requires measuring inter-feature relationships, rather than treating each feature as an isolated activation signal. Based on this insight, we propose CRaFT, a circuit-guided framework for identifying critical refusal features that directly govern the refusal decision. CRaFT leverages cross-layer transcoders to map the model's internal computations into a sparse feature circuit graph, where edges quantify inter-feature influences and their contributions to the final output logits. By aggregating the effects propagating along the paths to refusal, CRaFT effectively ranks the most influential features. Extensive evaluations across four jailbreak benchmarks show that CRaFT significantly improves average performance from 6.7% to 57.4% and generates more specific harmful completions compared to current SOTA methods.
title CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders
topic Artificial Intelligence
url https://arxiv.org/abs/2604.01604