Saved in:
| Main Authors: | Fillingham, Sean P., Gordon, Andrew, Lai, Peter, Poncini, Xavier, Quarel, David, Heimersheim, Stefan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.07572 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Evolution of SAE Features Across Layers in LLMs
by: Balcells, Daniel, et al.
Published: (2024)
by: Balcells, Daniel, et al.
Published: (2024)
Evaluating Synthetic Activations composed of SAE Latents in GPT-2
by: Giglemiani, Giorgi, et al.
Published: (2024)
by: Giglemiani, Giorgi, et al.
Published: (2024)
You can remove GPT2's LayerNorm by fine-tuning
by: Heimersheim, Stefan
Published: (2024)
by: Heimersheim, Stefan
Published: (2024)
How to use and interpret activation patching
by: Heimersheim, Stefan, et al.
Published: (2024)
by: Heimersheim, Stefan, et al.
Published: (2024)
Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
by: Lee, Daniel J., et al.
Published: (2024)
by: Lee, Daniel J., et al.
Published: (2024)
Interpreting Reinforcement Learning Agents with Susceptibilities
by: Elliott, Chris, et al.
Published: (2026)
by: Elliott, Chris, et al.
Published: (2026)
Benchmarking Deception Probes via Black-to-White Performance Boosts
by: Parrack, Avi, et al.
Published: (2025)
by: Parrack, Avi, et al.
Published: (2025)
Characterizing stable regions in the residual stream of LLMs
by: Janiak, Jett, et al.
Published: (2024)
by: Janiak, Jett, et al.
Published: (2024)
Stagewise Reinforcement Learning and the Geometry of the Regret Landscape
by: Elliott, Chris, et al.
Published: (2026)
by: Elliott, Chris, et al.
Published: (2026)
SCALAR: Self-Calibrating Adaptive Latent Attention Representation Learning
by: Abbas, Farwa, et al.
Published: (2025)
by: Abbas, Farwa, et al.
Published: (2025)
Detecting Strategic Deception Using Linear Probes
by: Goldowsky-Dill, Nicholas, et al.
Published: (2025)
by: Goldowsky-Dill, Nicholas, et al.
Published: (2025)
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
by: Taufeeque, Mohammad, et al.
Published: (2026)
by: Taufeeque, Mohammad, et al.
Published: (2026)
Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability
by: Baroni, Luca, et al.
Published: (2025)
by: Baroni, Luca, et al.
Published: (2025)
Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
by: Braun, Dan, et al.
Published: (2025)
by: Braun, Dan, et al.
Published: (2025)
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
by: Bushnaq, Lucius, et al.
Published: (2024)
by: Bushnaq, Lucius, et al.
Published: (2024)
SCALAR: Learning and Composing Skills through LLM Guided Symbolic Planning and Deep RL Grounding
by: Zabounidis, Renos, et al.
Published: (2026)
by: Zabounidis, Renos, et al.
Published: (2026)
DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking
by: Wiemann, Matt L., et al.
Published: (2026)
by: Wiemann, Matt L., et al.
Published: (2026)
SCALAR: Quantifying Structural Hallucination, Consistency, and Reasoning Gaps in Materials Foundation Models
by: Polat, Can, et al.
Published: (2026)
by: Polat, Can, et al.
Published: (2026)
Tokenized SAEs: Disentangling SAE Reconstructions
by: Dooms, Thomas, et al.
Published: (2025)
by: Dooms, Thomas, et al.
Published: (2025)
Evaluating SAE interpretability without explanations
by: Paulo, Gonçalo, et al.
Published: (2025)
by: Paulo, Gonçalo, et al.
Published: (2025)
PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding
by: Koromilas, Panagiotis, et al.
Published: (2026)
by: Koromilas, Panagiotis, et al.
Published: (2026)
Dynamic Sparsity: Challenging Common Sparsity Assumptions for Learning World Models in Robotic Reinforcement Learning Benchmarks
by: Pandaram, Muthukumar, et al.
Published: (2025)
by: Pandaram, Muthukumar, et al.
Published: (2025)
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
by: Yin, Lu, et al.
Published: (2023)
by: Yin, Lu, et al.
Published: (2023)
SAE: Single Architecture Ensemble Neural Networks
by: Ferianc, Martin, et al.
Published: (2024)
by: Ferianc, Martin, et al.
Published: (2024)
Investigating communication and professional communities at international events
by: Gina Poncini
Published: (2013)
by: Gina Poncini
Published: (2013)
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
by: Bushnaq, Lucius, et al.
Published: (2024)
by: Bushnaq, Lucius, et al.
Published: (2024)
Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity
by: Guo, Wentao, et al.
Published: (2024)
by: Guo, Wentao, et al.
Published: (2024)
Chatting Up Attachment: Using LLMs to Predict Adult Bonds
by: Soares, Paulo, et al.
Published: (2024)
by: Soares, Paulo, et al.
Published: (2024)
AlignSAE: Concept-Aligned Sparse Autoencoders
by: Yang, Minglai, et al.
Published: (2025)
by: Yang, Minglai, et al.
Published: (2025)
Beyond Toy Benchmarks: A Systematic Evaluation of OOD Detection Methods For Plant Pathology Classification
by: Shah, Devesh
Published: (2026)
by: Shah, Devesh
Published: (2026)
Dense SAE Latents Are Features, Not Bugs
by: Sun, Xiaoqing, et al.
Published: (2025)
by: Sun, Xiaoqing, et al.
Published: (2025)
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features
by: Korznikov, Anton, et al.
Published: (2025)
by: Korznikov, Anton, et al.
Published: (2025)
Concept-SAE: Active Causal Probing of Visual Model Behavior
by: Ding, Jianrong, et al.
Published: (2025)
by: Ding, Jianrong, et al.
Published: (2025)
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
by: Cao, Tue M., et al.
Published: (2026)
by: Cao, Tue M., et al.
Published: (2026)
PATCH: Learnable Tile-level Hybrid Sparsity for LLMs
by: Hourri, Younes, et al.
Published: (2025)
by: Hourri, Younes, et al.
Published: (2025)
Sparsity Forcing: Reinforcing Token Sparsity of MLLMs
by: Chen, Feng, et al.
Published: (2025)
by: Chen, Feng, et al.
Published: (2025)
Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect
by: Presa, Joao Paulo Cavalcante, et al.
Published: (2026)
by: Presa, Joao Paulo Cavalcante, et al.
Published: (2026)
Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity
by: Xu, Haotian, et al.
Published: (2025)
by: Xu, Haotian, et al.
Published: (2025)
Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization
by: Li, Guanchen, et al.
Published: (2025)
by: Li, Guanchen, et al.
Published: (2025)
SAU: Sparsity-Aware Unlearning for LLMs via Gradient Masking and Importance Redistribution
by: Wang, Yuze, et al.
Published: (2026)
by: Wang, Yuze, et al.
Published: (2026)
Similar Items
-
Evolution of SAE Features Across Layers in LLMs
by: Balcells, Daniel, et al.
Published: (2024) -
Evaluating Synthetic Activations composed of SAE Latents in GPT-2
by: Giglemiani, Giorgi, et al.
Published: (2024) -
You can remove GPT2's LayerNorm by fine-tuning
by: Heimersheim, Stefan
Published: (2024) -
How to use and interpret activation patching
by: Heimersheim, Stefan, et al.
Published: (2024) -
Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
by: Lee, Daniel J., et al.
Published: (2024)