:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Pluth, Dan, Houghton, Zachary Nicholas, Zhou, Yu, Gurbani, Vijay K.
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2605.12225
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Sparse Autoencoder Insights on Voice Embeddings
by: Pluth, Daniel, et al.
Published: (2025)

How susceptible are LLMs to Logical Fallacies?
by: Payandeh, Amirreza, et al.
Published: (2023)

Digits micro-model for accurate and secure transactions
by: Chhablani, Chirag, et al.
Published: (2024)

DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
by: Wang, Xu, et al.
Published: (2026)

Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
by: Farnik, Lucy, et al.
Published: (2025)

Beyond Transcription: Mechanistic Interpretability in ASR
by: Glazer, Neta, et al.
Published: (2025)

How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding
by: Chen, Xi, et al.
Published: (2025)

Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech
by: Du, Hongfei, et al.
Published: (2026)

Binary Autoencoder for Mechanistic Interpretability of Large Language Models
by: Cho, Hakaze, et al.
Published: (2025)

Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders
by: Zheng, Carolina, et al.
Published: (2025)

Interpretable Company Similarity with Sparse Autoencoders
by: Molinari, Marco, et al.
Published: (2024)

Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders
by: Kurochkin, Vadim, et al.
Published: (2025)

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
by: Wu, Xuansheng, et al.
Published: (2025)

Investigating grammatical abstraction in language models using few-shot learning of novel noun gender
by: Sukumaran, Priyanka, et al.
Published: (2024)

PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders
by: Frikha, Ahmed, et al.
Published: (2025)

Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
by: Zhang, Ruikang, et al.
Published: (2026)

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
by: Karvonen, Adam, et al.
Published: (2025)

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
by: Bhalla, Usha, et al.
Published: (2025)

CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders
by: Gulko, Alex, et al.
Published: (2025)

Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs
by: Harshavardhan
Published: (2026)

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
by: Lehn-Schiøler, William, et al.
Published: (2026)

A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
by: Shu, Dong, et al.
Published: (2025)

Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification
by: Wu, Xuansheng, et al.
Published: (2025)

Sparse Autoencoders for Hypothesis Generation
by: Movva, Rajiv, et al.
Published: (2025)

Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
by: Yu, Zeping, et al.
Published: (2024)

SteerRM: Debiasing Reward Models via Sparse Autoencoders
by: Sun, Mengyuan, et al.
Published: (2026)

Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders
by: Xuan, Richmond Sin Jing, et al.
Published: (2025)

Constrain Alignment with Sparse Autoencoders
by: Yin, Qingyu, et al.
Published: (2024)

Mechanistic Interpretability Needs Philosophy
by: Williams, Iwan, et al.
Published: (2025)

Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
by: Muhamed, Aashiq, et al.
Published: (2024)

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
by: He, Zirui, et al.
Published: (2025)

Understanding Refusal in Language Models with Sparse Autoencoders
by: Yeo, Wei Jie, et al.
Published: (2025)

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders
by: Galichin, Andrey, et al.
Published: (2025)

Modeling language contact with the Iterated Learning Model
by: Bullock, Seth, et al.
Published: (2024)

Task Arithmetic with Support Languages for Low-Resource ASR
by: Rafkin, Emma, et al.
Published: (2026)

Mechanistic Interpretability of Binary and Ternary Transformers
by: Li, Jason
Published: (2024)

ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders
by: Liu, Xiangyu, et al.
Published: (2025)

SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder
by: Liu, Dengcan, et al.
Published: (2025)

Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
by: Deng, Boyi, et al.
Published: (2025)

Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention
by: Rosser, J, et al.
Published: (2025)