:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Gulko, Alex, Peng, Yusen, Kumar, Sachin
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2509.00691
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

DRIP: Dynamic patch Reduction via Interpretable Pooling
by: Peng, Yusen, et al.
Published: (2025)

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
by: Karvonen, Adam, et al.
Published: (2025)

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
by: Bhalla, Usha, et al.
Published: (2025)

Mechanistic Interpretability of ASR models using Sparse Autoencoders
by: Pluth, Dan, et al.
Published: (2026)

Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech
by: Du, Hongfei, et al.
Published: (2026)

Interpretable Company Similarity with Sparse Autoencoders
by: Molinari, Marco, et al.
Published: (2024)

GroundCocoa: A Benchmark for Evaluating Compositional & Conditional Reasoning in Language Models
by: Kohli, Harsh, et al.
Published: (2024)

LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing
by: Fein, Daniel, et al.
Published: (2025)

Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders
by: Kurochkin, Vadim, et al.
Published: (2025)

KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge
by: Robertson, Alex, et al.
Published: (2026)

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
by: Wu, Xuansheng, et al.
Published: (2025)

Sparse Autoencoders for Hypothesis Generation
by: Movva, Rajiv, et al.
Published: (2025)

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
by: O'Neill, Charles, et al.
Published: (2024)

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
by: Perlitz, Yotam, et al.
Published: (2024)

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
by: He, Muyu, et al.
Published: (2026)

Towards Understanding the Robustness of Sparse Autoencoders
by: Saiyed, Ahson, et al.
Published: (2026)

Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders
by: Xiong, Guangzhi, et al.
Published: (2025)

PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators
by: Hoang, Nguyen Khoi, et al.
Published: (2026)

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
by: Karvonen, Adam, et al.
Published: (2024)

SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation
by: Ma, Zhiming, et al.
Published: (2025)

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework
by: Weng, Jiaqi, et al.
Published: (2025)

PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders
by: Frikha, Ahmed, et al.
Published: (2025)

BenchBench: Benchmarking Automated Benchmark Generation
by: Zheng, Yandan, et al.
Published: (2026)

Training Superior Sparse Autoencoders for Instruct Models
by: Li, Jiaming, et al.
Published: (2025)

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
by: Liu, Mianxin, et al.
Published: (2024)

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
by: Wu, Zhengxuan, et al.
Published: (2025)

A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
by: Shu, Dong, et al.
Published: (2025)

DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
by: Wang, Xu, et al.
Published: (2026)

Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders
by: Li, Aaron J., et al.
Published: (2025)

Sparse Autoencoder Insights on Voice Embeddings
by: Pluth, Daniel, et al.
Published: (2025)

Steering off Course: Reliability Challenges in Steering Language Models
by: Da Silva, Patrick Queiroz, et al.
Published: (2025)

Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
by: Maekawa, Seiji, et al.
Published: (2025)

How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding
by: Chen, Xi, et al.
Published: (2025)

Constrain Alignment with Sparse Autoencoders
by: Yin, Qingyu, et al.
Published: (2024)

StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
by: Wu, Cheng-Kuang, et al.
Published: (2024)

PhageBench: Can LLMs Understand Raw Bacteriophage Genomes?
by: Hou, Yusen, et al.
Published: (2026)

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
by: He, Zirui, et al.
Published: (2025)

Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
by: Muhamed, Aashiq, et al.
Published: (2024)

HealthBench: Evaluating Large Language Models Towards Improved Human Health
by: Arora, Rahul K., et al.
Published: (2025)

Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
by: Minegishi, Gouki, et al.
Published: (2025)