Saved in:
| Main Authors: | Luo, Yifan, Zhan, Yang, Jiang, Jiedong, Liu, Tianyang, Wu, Mingrui, Zhou, Zhennan, Dong, Bin |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.11881 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models
by: Luo, Yifan, et al.
Published: (2025)
by: Luo, Yifan, et al.
Published: (2025)
Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
by: Luo, Yifan, et al.
Published: (2024)
by: Luo, Yifan, et al.
Published: (2024)
The Geometry of Concepts: Sparse Autoencoder Feature Structure
by: Li, Yuxiao, et al.
Published: (2024)
by: Li, Yuxiao, et al.
Published: (2024)
Measuring Sparse Autoencoder Feature Sensitivity
by: Tian, Claire, et al.
Published: (2025)
by: Tian, Claire, et al.
Published: (2025)
Graph-Regularized Sparse Autoencoders for LLM Safety Steering
by: Yeon, Jehyeok, et al.
Published: (2025)
by: Yeon, Jehyeok, et al.
Published: (2025)
Interpreting CLIP with Hierarchical Sparse Autoencoders
by: Zaigrajew, Vladimir, et al.
Published: (2025)
by: Zaigrajew, Vladimir, et al.
Published: (2025)
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
by: Cao, Tue M., et al.
Published: (2026)
by: Cao, Tue M., et al.
Published: (2026)
Sparse Autoencoder Features for Classifications and Transferability
by: Gallifant, Jack, et al.
Published: (2025)
by: Gallifant, Jack, et al.
Published: (2025)
Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures
by: Muchane, Mark, et al.
Published: (2025)
by: Muchane, Mark, et al.
Published: (2025)
Towards Interpretable Protein Structure Prediction with Sparse Autoencoders
by: Parsan, Nithin, et al.
Published: (2025)
by: Parsan, Nithin, et al.
Published: (2025)
From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features
by: Fernandez-Boullon, Ruben, et al.
Published: (2026)
by: Fernandez-Boullon, Ruben, et al.
Published: (2026)
Feature Starvation as Geometric Instability in Sparse Autoencoders
by: Chaudhry, Faris, et al.
Published: (2026)
by: Chaudhry, Faris, et al.
Published: (2026)
Causal Interpretation of Sparse Autoencoder Features in Vision
by: Han, Sangyu, et al.
Published: (2025)
by: Han, Sangyu, et al.
Published: (2025)
Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders
by: Ayonrinde, Kola
Published: (2024)
by: Ayonrinde, Kola
Published: (2024)
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)
by: Chanin, David, et al.
Published: (2025)
Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement
by: Wang, Anyi, et al.
Published: (2025)
by: Wang, Anyi, et al.
Published: (2025)
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
by: Bussmann, Bart, et al.
Published: (2025)
by: Bussmann, Bart, et al.
Published: (2025)
Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features
by: Winnicki, John, et al.
Published: (2026)
by: Winnicki, John, et al.
Published: (2026)
LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving
by: Gao, Guoxiong, et al.
Published: (2026)
by: Gao, Guoxiong, et al.
Published: (2026)
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
by: Shu, Dong, et al.
Published: (2025)
by: Shu, Dong, et al.
Published: (2025)
Improving Steering Vectors by Targeting Sparse Autoencoder Features
by: Chalnev, Sviatoslav, et al.
Published: (2024)
by: Chalnev, Sviatoslav, et al.
Published: (2024)
Autoencoding Random Forests
by: Vu, Binh Duc, et al.
Published: (2025)
by: Vu, Binh Duc, et al.
Published: (2025)
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)
by: Chanin, David, et al.
Published: (2025)
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
by: Chanin, David, et al.
Published: (2024)
by: Chanin, David, et al.
Published: (2024)
Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders
by: Chen, Siyu, et al.
Published: (2025)
by: Chen, Siyu, et al.
Published: (2025)
Sparse Autoencoder Decomposition of Clinical Sequence Model Representations: Feature Complexity, Task Specialisation, and Mortality Prediction
by: Sainsbury, Chris, et al.
Published: (2026)
by: Sainsbury, Chris, et al.
Published: (2026)
AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features
by: Zhu, Xudong, et al.
Published: (2025)
by: Zhu, Xudong, et al.
Published: (2025)
Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression
by: Cao, Tue M., et al.
Published: (2026)
by: Cao, Tue M., et al.
Published: (2026)
Beyond Self-Play: Hierarchical Reasoning for Continuous Motion in Closed-Loop Traffic Simulation
by: Zhang, Weifan, et al.
Published: (2026)
by: Zhang, Weifan, et al.
Published: (2026)
Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering
by: Zhao, Haiyan, et al.
Published: (2025)
by: Zhao, Haiyan, et al.
Published: (2025)
Herald: A Natural Language Annotated Lean 4 Dataset
by: Gao, Guoxiong, et al.
Published: (2024)
by: Gao, Guoxiong, et al.
Published: (2024)
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models
by: Pach, Mateusz, et al.
Published: (2025)
by: Pach, Mateusz, et al.
Published: (2025)
Constrain Alignment with Sparse Autoencoders
by: Yin, Qingyu, et al.
Published: (2024)
by: Yin, Qingyu, et al.
Published: (2024)
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
by: Shu, Dong, et al.
Published: (2025)
by: Shu, Dong, et al.
Published: (2025)
Sparse Autoencoders, Again?
by: Lu, Yin, et al.
Published: (2025)
by: Lu, Yin, et al.
Published: (2025)
CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features
by: Cho, Seonglae, et al.
Published: (2025)
by: Cho, Seonglae, et al.
Published: (2025)
SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder
by: Liu, Dengcan, et al.
Published: (2025)
by: Liu, Dengcan, et al.
Published: (2025)
Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features
by: Cho, Seonglae, et al.
Published: (2026)
by: Cho, Seonglae, et al.
Published: (2026)
Are Sparse Autoencoder Benchmarks Reliable?
by: Chanin, David
Published: (2026)
by: Chanin, David
Published: (2026)
Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone
by: Bărbălau, Antonio, et al.
Published: (2025)
by: Bărbălau, Antonio, et al.
Published: (2025)
Similar Items
-
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models
by: Luo, Yifan, et al.
Published: (2025) -
Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
by: Luo, Yifan, et al.
Published: (2024) -
The Geometry of Concepts: Sparse Autoencoder Feature Structure
by: Li, Yuxiao, et al.
Published: (2024) -
Measuring Sparse Autoencoder Feature Sensitivity
by: Tian, Claire, et al.
Published: (2025) -
Graph-Regularized Sparse Autoencoders for LLM Safety Steering
by: Yeon, Jehyeok, et al.
Published: (2025)