Saved in:
| Main Authors: | Gulko, Alex, Peng, Yusen, Kumar, Sachin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.00691 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DRIP: Dynamic patch Reduction via Interpretable Pooling
by: Peng, Yusen, et al.
Published: (2025)
by: Peng, Yusen, et al.
Published: (2025)
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
by: Karvonen, Adam, et al.
Published: (2025)
by: Karvonen, Adam, et al.
Published: (2025)
Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
by: Bhalla, Usha, et al.
Published: (2025)
by: Bhalla, Usha, et al.
Published: (2025)
Mechanistic Interpretability of ASR models using Sparse Autoencoders
by: Pluth, Dan, et al.
Published: (2026)
by: Pluth, Dan, et al.
Published: (2026)
Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech
by: Du, Hongfei, et al.
Published: (2026)
by: Du, Hongfei, et al.
Published: (2026)
Interpretable Company Similarity with Sparse Autoencoders
by: Molinari, Marco, et al.
Published: (2024)
by: Molinari, Marco, et al.
Published: (2024)
GroundCocoa: A Benchmark for Evaluating Compositional & Conditional Reasoning in Language Models
by: Kohli, Harsh, et al.
Published: (2024)
by: Kohli, Harsh, et al.
Published: (2024)
LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing
by: Fein, Daniel, et al.
Published: (2025)
by: Fein, Daniel, et al.
Published: (2025)
Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders
by: Kurochkin, Vadim, et al.
Published: (2025)
by: Kurochkin, Vadim, et al.
Published: (2025)
KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge
by: Robertson, Alex, et al.
Published: (2026)
by: Robertson, Alex, et al.
Published: (2026)
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
by: Wu, Xuansheng, et al.
Published: (2025)
by: Wu, Xuansheng, et al.
Published: (2025)
Sparse Autoencoders for Hypothesis Generation
by: Movva, Rajiv, et al.
Published: (2025)
by: Movva, Rajiv, et al.
Published: (2025)
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
by: O'Neill, Charles, et al.
Published: (2024)
by: O'Neill, Charles, et al.
Published: (2024)
Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
by: Perlitz, Yotam, et al.
Published: (2024)
by: Perlitz, Yotam, et al.
Published: (2024)
$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
by: He, Muyu, et al.
Published: (2026)
by: He, Muyu, et al.
Published: (2026)
Towards Understanding the Robustness of Sparse Autoencoders
by: Saiyed, Ahson, et al.
Published: (2026)
by: Saiyed, Ahson, et al.
Published: (2026)
Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders
by: Xiong, Guangzhi, et al.
Published: (2025)
by: Xiong, Guangzhi, et al.
Published: (2025)
PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators
by: Hoang, Nguyen Khoi, et al.
Published: (2026)
by: Hoang, Nguyen Khoi, et al.
Published: (2026)
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
by: Karvonen, Adam, et al.
Published: (2024)
by: Karvonen, Adam, et al.
Published: (2024)
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation
by: Ma, Zhiming, et al.
Published: (2025)
by: Ma, Zhiming, et al.
Published: (2025)
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework
by: Weng, Jiaqi, et al.
Published: (2025)
by: Weng, Jiaqi, et al.
Published: (2025)
PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders
by: Frikha, Ahmed, et al.
Published: (2025)
by: Frikha, Ahmed, et al.
Published: (2025)
BenchBench: Benchmarking Automated Benchmark Generation
by: Zheng, Yandan, et al.
Published: (2026)
by: Zheng, Yandan, et al.
Published: (2026)
Training Superior Sparse Autoencoders for Instruct Models
by: Li, Jiaming, et al.
Published: (2025)
by: Li, Jiaming, et al.
Published: (2025)
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
by: Liu, Mianxin, et al.
Published: (2024)
by: Liu, Mianxin, et al.
Published: (2024)
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
by: Wu, Zhengxuan, et al.
Published: (2025)
by: Wu, Zhengxuan, et al.
Published: (2025)
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
by: Shu, Dong, et al.
Published: (2025)
by: Shu, Dong, et al.
Published: (2025)
DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
by: Wang, Xu, et al.
Published: (2026)
by: Wang, Xu, et al.
Published: (2026)
Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders
by: Li, Aaron J., et al.
Published: (2025)
by: Li, Aaron J., et al.
Published: (2025)
Sparse Autoencoder Insights on Voice Embeddings
by: Pluth, Daniel, et al.
Published: (2025)
by: Pluth, Daniel, et al.
Published: (2025)
Steering off Course: Reliability Challenges in Steering Language Models
by: Da Silva, Patrick Queiroz, et al.
Published: (2025)
by: Da Silva, Patrick Queiroz, et al.
Published: (2025)
Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
by: Maekawa, Seiji, et al.
Published: (2025)
by: Maekawa, Seiji, et al.
Published: (2025)
How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding
by: Chen, Xi, et al.
Published: (2025)
by: Chen, Xi, et al.
Published: (2025)
Constrain Alignment with Sparse Autoencoders
by: Yin, Qingyu, et al.
Published: (2024)
by: Yin, Qingyu, et al.
Published: (2024)
StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
by: Wu, Cheng-Kuang, et al.
Published: (2024)
by: Wu, Cheng-Kuang, et al.
Published: (2024)
PhageBench: Can LLMs Understand Raw Bacteriophage Genomes?
by: Hou, Yusen, et al.
Published: (2026)
by: Hou, Yusen, et al.
Published: (2026)
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
by: He, Zirui, et al.
Published: (2025)
by: He, Zirui, et al.
Published: (2025)
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
by: Muhamed, Aashiq, et al.
Published: (2024)
by: Muhamed, Aashiq, et al.
Published: (2024)
HealthBench: Evaluating Large Language Models Towards Improved Human Health
by: Arora, Rahul K., et al.
Published: (2025)
by: Arora, Rahul K., et al.
Published: (2025)
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
by: Minegishi, Gouki, et al.
Published: (2025)
by: Minegishi, Gouki, et al.
Published: (2025)
Similar Items
-
DRIP: Dynamic patch Reduction via Interpretable Pooling
by: Peng, Yusen, et al.
Published: (2025) -
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
by: Karvonen, Adam, et al.
Published: (2025) -
Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
by: Bhalla, Usha, et al.
Published: (2025) -
Mechanistic Interpretability of ASR models using Sparse Autoencoders
by: Pluth, Dan, et al.
Published: (2026) -
Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech
by: Du, Hongfei, et al.
Published: (2026)