:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Fillingham, Sean P., Gordon, Andrew, Lai, Peter, Poncini, Xavier, Quarel, David, Heimersheim, Stefan
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2511.07572
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Evolution of SAE Features Across Layers in LLMs
by: Balcells, Daniel, et al.
Published: (2024)

Evaluating Synthetic Activations composed of SAE Latents in GPT-2
by: Giglemiani, Giorgi, et al.
Published: (2024)

You can remove GPT2's LayerNorm by fine-tuning
by: Heimersheim, Stefan
Published: (2024)

How to use and interpret activation patching
by: Heimersheim, Stefan, et al.
Published: (2024)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
by: Lee, Daniel J., et al.
Published: (2024)

Interpreting Reinforcement Learning Agents with Susceptibilities
by: Elliott, Chris, et al.
Published: (2026)

Benchmarking Deception Probes via Black-to-White Performance Boosts
by: Parrack, Avi, et al.
Published: (2025)

Characterizing stable regions in the residual stream of LLMs
by: Janiak, Jett, et al.
Published: (2024)

Stagewise Reinforcement Learning and the Geometry of the Regret Landscape
by: Elliott, Chris, et al.
Published: (2026)

SCALAR: Self-Calibrating Adaptive Latent Attention Representation Learning
by: Abbas, Farwa, et al.
Published: (2025)

Detecting Strategic Deception Using Linear Probes
by: Goldowsky-Dill, Nicholas, et al.
Published: (2025)

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
by: Taufeeque, Mohammad, et al.
Published: (2026)

Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability
by: Baroni, Luca, et al.
Published: (2025)

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
by: Braun, Dan, et al.
Published: (2025)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
by: Bushnaq, Lucius, et al.
Published: (2024)

SCALAR: Learning and Composing Skills through LLM Guided Symbolic Planning and Deep RL Grounding
by: Zabounidis, Renos, et al.
Published: (2026)

DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking
by: Wiemann, Matt L., et al.
Published: (2026)

SCALAR: Quantifying Structural Hallucination, Consistency, and Reasoning Gaps in Materials Foundation Models
by: Polat, Can, et al.
Published: (2026)

Tokenized SAEs: Disentangling SAE Reconstructions
by: Dooms, Thomas, et al.
Published: (2025)

Evaluating SAE interpretability without explanations
by: Paulo, Gonçalo, et al.
Published: (2025)

PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding
by: Koromilas, Panagiotis, et al.
Published: (2026)

Dynamic Sparsity: Challenging Common Sparsity Assumptions for Learning World Models in Robotic Reinforcement Learning Benchmarks
by: Pandaram, Muthukumar, et al.
Published: (2025)

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
by: Yin, Lu, et al.
Published: (2023)

SAE: Single Architecture Ensemble Neural Networks
by: Ferianc, Martin, et al.
Published: (2024)

Investigating communication and professional communities at international events
by: Gina Poncini
Published: (2013)

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
by: Bushnaq, Lucius, et al.
Published: (2024)

Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity
by: Guo, Wentao, et al.
Published: (2024)

Chatting Up Attachment: Using LLMs to Predict Adult Bonds
by: Soares, Paulo, et al.
Published: (2024)

AlignSAE: Concept-Aligned Sparse Autoencoders
by: Yang, Minglai, et al.
Published: (2025)

Beyond Toy Benchmarks: A Systematic Evaluation of OOD Detection Methods For Plant Pathology Classification
by: Shah, Devesh
Published: (2026)

Dense SAE Latents Are Features, Not Bugs
by: Sun, Xiaoqing, et al.
Published: (2025)

OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features
by: Korznikov, Anton, et al.
Published: (2025)

Concept-SAE: Active Causal Probing of Visual Model Behavior
by: Ding, Jianrong, et al.
Published: (2025)

Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
by: Cao, Tue M., et al.
Published: (2026)

PATCH: Learnable Tile-level Hybrid Sparsity for LLMs
by: Hourri, Younes, et al.
Published: (2025)

Sparsity Forcing: Reinforcing Token Sparsity of MLLMs
by: Chen, Feng, et al.
Published: (2025)

Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect
by: Presa, Joao Paulo Cavalcante, et al.
Published: (2026)

Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity
by: Xu, Haotian, et al.
Published: (2025)

Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization
by: Li, Guanchen, et al.
Published: (2025)

SAU: Sparsity-Aware Unlearning for LLMs via Gradient Masking and Importance Redistribution
by: Wang, Yuze, et al.
Published: (2026)