Saved in:
| Main Authors: | Pluth, Dan, Houghton, Zachary Nicholas, Zhou, Yu, Gurbani, Vijay K. |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.12225 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Sparse Autoencoder Insights on Voice Embeddings
by: Pluth, Daniel, et al.
Published: (2025)
by: Pluth, Daniel, et al.
Published: (2025)
How susceptible are LLMs to Logical Fallacies?
by: Payandeh, Amirreza, et al.
Published: (2023)
by: Payandeh, Amirreza, et al.
Published: (2023)
Digits micro-model for accurate and secure transactions
by: Chhablani, Chirag, et al.
Published: (2024)
by: Chhablani, Chirag, et al.
Published: (2024)
DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
by: Wang, Xu, et al.
Published: (2026)
by: Wang, Xu, et al.
Published: (2026)
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
by: Farnik, Lucy, et al.
Published: (2025)
by: Farnik, Lucy, et al.
Published: (2025)
Beyond Transcription: Mechanistic Interpretability in ASR
by: Glazer, Neta, et al.
Published: (2025)
by: Glazer, Neta, et al.
Published: (2025)
How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding
by: Chen, Xi, et al.
Published: (2025)
by: Chen, Xi, et al.
Published: (2025)
Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech
by: Du, Hongfei, et al.
Published: (2026)
by: Du, Hongfei, et al.
Published: (2026)
Binary Autoencoder for Mechanistic Interpretability of Large Language Models
by: Cho, Hakaze, et al.
Published: (2025)
by: Cho, Hakaze, et al.
Published: (2025)
Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders
by: Zheng, Carolina, et al.
Published: (2025)
by: Zheng, Carolina, et al.
Published: (2025)
Interpretable Company Similarity with Sparse Autoencoders
by: Molinari, Marco, et al.
Published: (2024)
by: Molinari, Marco, et al.
Published: (2024)
Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders
by: Kurochkin, Vadim, et al.
Published: (2025)
by: Kurochkin, Vadim, et al.
Published: (2025)
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
by: Wu, Xuansheng, et al.
Published: (2025)
by: Wu, Xuansheng, et al.
Published: (2025)
Investigating grammatical abstraction in language models using few-shot learning of novel noun gender
by: Sukumaran, Priyanka, et al.
Published: (2024)
by: Sukumaran, Priyanka, et al.
Published: (2024)
PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders
by: Frikha, Ahmed, et al.
Published: (2025)
by: Frikha, Ahmed, et al.
Published: (2025)
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
by: Zhang, Ruikang, et al.
Published: (2026)
by: Zhang, Ruikang, et al.
Published: (2026)
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
by: Karvonen, Adam, et al.
Published: (2025)
by: Karvonen, Adam, et al.
Published: (2025)
Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
by: Bhalla, Usha, et al.
Published: (2025)
by: Bhalla, Usha, et al.
Published: (2025)
CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders
by: Gulko, Alex, et al.
Published: (2025)
by: Gulko, Alex, et al.
Published: (2025)
Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs
by: Harshavardhan
Published: (2026)
by: Harshavardhan
Published: (2026)
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
by: Lehn-Schiøler, William, et al.
Published: (2026)
by: Lehn-Schiøler, William, et al.
Published: (2026)
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
by: Shu, Dong, et al.
Published: (2025)
by: Shu, Dong, et al.
Published: (2025)
Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification
by: Wu, Xuansheng, et al.
Published: (2025)
by: Wu, Xuansheng, et al.
Published: (2025)
Sparse Autoencoders for Hypothesis Generation
by: Movva, Rajiv, et al.
Published: (2025)
by: Movva, Rajiv, et al.
Published: (2025)
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
by: Yu, Zeping, et al.
Published: (2024)
by: Yu, Zeping, et al.
Published: (2024)
SteerRM: Debiasing Reward Models via Sparse Autoencoders
by: Sun, Mengyuan, et al.
Published: (2026)
by: Sun, Mengyuan, et al.
Published: (2026)
Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders
by: Xuan, Richmond Sin Jing, et al.
Published: (2025)
by: Xuan, Richmond Sin Jing, et al.
Published: (2025)
Constrain Alignment with Sparse Autoencoders
by: Yin, Qingyu, et al.
Published: (2024)
by: Yin, Qingyu, et al.
Published: (2024)
Mechanistic Interpretability Needs Philosophy
by: Williams, Iwan, et al.
Published: (2025)
by: Williams, Iwan, et al.
Published: (2025)
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
by: Muhamed, Aashiq, et al.
Published: (2024)
by: Muhamed, Aashiq, et al.
Published: (2024)
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
by: He, Zirui, et al.
Published: (2025)
by: He, Zirui, et al.
Published: (2025)
Understanding Refusal in Language Models with Sparse Autoencoders
by: Yeo, Wei Jie, et al.
Published: (2025)
by: Yeo, Wei Jie, et al.
Published: (2025)
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders
by: Galichin, Andrey, et al.
Published: (2025)
by: Galichin, Andrey, et al.
Published: (2025)
Modeling language contact with the Iterated Learning Model
by: Bullock, Seth, et al.
Published: (2024)
by: Bullock, Seth, et al.
Published: (2024)
Task Arithmetic with Support Languages for Low-Resource ASR
by: Rafkin, Emma, et al.
Published: (2026)
by: Rafkin, Emma, et al.
Published: (2026)
Mechanistic Interpretability of Binary and Ternary Transformers
by: Li, Jason
Published: (2024)
by: Li, Jason
Published: (2024)
ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders
by: Liu, Xiangyu, et al.
Published: (2025)
by: Liu, Xiangyu, et al.
Published: (2025)
SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder
by: Liu, Dengcan, et al.
Published: (2025)
by: Liu, Dengcan, et al.
Published: (2025)
Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
by: Deng, Boyi, et al.
Published: (2025)
by: Deng, Boyi, et al.
Published: (2025)
Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention
by: Rosser, J, et al.
Published: (2025)
by: Rosser, J, et al.
Published: (2025)
Similar Items
-
Sparse Autoencoder Insights on Voice Embeddings
by: Pluth, Daniel, et al.
Published: (2025) -
How susceptible are LLMs to Logical Fallacies?
by: Payandeh, Amirreza, et al.
Published: (2023) -
Digits micro-model for accurate and secure transactions
by: Chhablani, Chirag, et al.
Published: (2024) -
DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
by: Wang, Xu, et al.
Published: (2026) -
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
by: Farnik, Lucy, et al.
Published: (2025)