:: Library Catalog

Imatge de la portada

Guardat en:

Dades bibliogràfiques
Autors principals:	Bussmann, Bart, Leask, Patrick, Nanda, Neel
Format:	Preprint
Publicat:	2024
Matèries:	Machine Learning Artificial Intelligence
Accés en línia:	https://arxiv.org/abs/2412.06410
Etiquetes:	Afegir etiqueta Sense etiquetes, Sigues el primer a etiquetar aquest registre!

Ítems similars

Sparse Autoencoders Do Not Find Canonical Units of Analysis
per: Leask, Patrick, et al.
Publicat: (2025)

Learning Multi-Level Features with Matryoshka Sparse Autoencoders
per: Bussmann, Bart, et al.
Publicat: (2025)

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
per: Kantamneni, Subhash, et al.
Publicat: (2025)

Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs
per: Bozoukov, Matthew, et al.
Publicat: (2025)

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit
per: Jiang, Nick, et al.
Publicat: (2025)

Improving Dictionary Learning with Gated Sparse Autoencoders
per: Rajamanoharan, Senthooran, et al.
Publicat: (2024)

AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features
per: Zhu, Xudong, et al.
Publicat: (2025)

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
per: Leask, Patrick, et al.
Publicat: (2025)

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions
per: Poduval, Prathyush, et al.
Publicat: (2026)

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
per: Lieberum, Tom, et al.
Publicat: (2024)

Explorations of Self-Repair in Language Models
per: Rushing, Cody, et al.
Publicat: (2024)

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
per: Zhang, Fred, et al.
Publicat: (2023)

Bidirectional Variational Autoencoders
per: Kosko, Bart, et al.
Publicat: (2025)

Convergent Linear Representations of Emergent Misalignment
per: Soligo, Anna, et al.
Publicat: (2025)

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
per: Makelov, Aleksandar, et al.
Publicat: (2024)

Sparse Autoencoders, Again?
per: Lu, Yin, et al.
Publicat: (2025)

Model Organisms for Emergent Misalignment
per: Turner, Edward, et al.
Publicat: (2025)

Understanding Reasoning in Thinking Language Models via Steering Vectors
per: Venhoff, Constantin, et al.
Publicat: (2025)

Base Models Know How to Reason, Thinking Models Learn When
per: Venhoff, Constantin, et al.
Publicat: (2025)

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
per: Ferrando, Javier, et al.
Publicat: (2024)

Are Sparse Autoencoder Benchmarks Reliable?
per: Chanin, David
Publicat: (2026)

Thought Branches: Interpreting LLM Reasoning Requires Resampling
per: Macar, Uzay, et al.
Publicat: (2025)

Thought Anchors: Which LLM Reasoning Steps Matter?
per: Bogdan, Paul C., et al.
Publicat: (2025)

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
per: Karvonen, Adam, et al.
Publicat: (2024)

Improving Sparse Autoencoder with Dynamic Attention
per: Wang, Dongsheng, et al.
Publicat: (2026)

What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering
per: Maar, Jim, et al.
Publicat: (2026)

Scaling sparse feature circuit finding for in-context learning
per: Kharlapenko, Dmitrii, et al.
Publicat: (2025)

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
per: Minder, Julian, et al.
Publicat: (2025)

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers
per: Ji, Xiaotong, et al.
Publicat: (2026)

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders
per: Ayonrinde, Kola
Publicat: (2024)

Empirical Evaluation of Progressive Coding for Sparse Autoencoders
per: Peter, Hans, et al.
Publicat: (2025)

Do Sparse Autoencoders Capture Concept Manifolds?
per: Bhalla, Usha, et al.
Publicat: (2026)

On the transferability of Sparse Autoencoders for interpreting compressed models
per: Gupte, Suchit, et al.
Publicat: (2025)

Data Whitening Improves Sparse Autoencoder Learning
per: Saraswatula, Ashwin, et al.
Publicat: (2025)

Interpreting Attention Layer Outputs with Sparse Autoencoders
per: Kissane, Connor, et al.
Publicat: (2024)

Improving Robustness In Sparse Autoencoders via Masked Regularization
per: Narayanaswamy, Vivek, et al.
Publicat: (2026)

Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
per: Lee, Sewoong, et al.
Publicat: (2025)

Graph-Regularized Sparse Autoencoders for LLM Safety Steering
per: Yeon, Jehyeok, et al.
Publicat: (2025)

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
per: Casademunt, Helena, et al.
Publicat: (2025)

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
per: Arcuschin, Iván, et al.
Publicat: (2025)