Guardat en:
| Autors principals: | Bussmann, Bart, Leask, Patrick, Nanda, Neel |
|---|---|
| Format: | Preprint |
| Publicat: |
2024
|
| Matèries: | |
| Accés en línia: | https://arxiv.org/abs/2412.06410 |
| Etiquetes: |
Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!
|
Ítems similars
Sparse Autoencoders Do Not Find Canonical Units of Analysis
per: Leask, Patrick, et al.
Publicat: (2025)
per: Leask, Patrick, et al.
Publicat: (2025)
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
per: Bussmann, Bart, et al.
Publicat: (2025)
per: Bussmann, Bart, et al.
Publicat: (2025)
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
per: Kantamneni, Subhash, et al.
Publicat: (2025)
per: Kantamneni, Subhash, et al.
Publicat: (2025)
Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs
per: Bozoukov, Matthew, et al.
Publicat: (2025)
per: Bozoukov, Matthew, et al.
Publicat: (2025)
Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit
per: Jiang, Nick, et al.
Publicat: (2025)
per: Jiang, Nick, et al.
Publicat: (2025)
Improving Dictionary Learning with Gated Sparse Autoencoders
per: Rajamanoharan, Senthooran, et al.
Publicat: (2024)
per: Rajamanoharan, Senthooran, et al.
Publicat: (2024)
AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features
per: Zhu, Xudong, et al.
Publicat: (2025)
per: Zhu, Xudong, et al.
Publicat: (2025)
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
per: Leask, Patrick, et al.
Publicat: (2025)
per: Leask, Patrick, et al.
Publicat: (2025)
ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions
per: Poduval, Prathyush, et al.
Publicat: (2026)
per: Poduval, Prathyush, et al.
Publicat: (2026)
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
per: Lieberum, Tom, et al.
Publicat: (2024)
per: Lieberum, Tom, et al.
Publicat: (2024)
Explorations of Self-Repair in Language Models
per: Rushing, Cody, et al.
Publicat: (2024)
per: Rushing, Cody, et al.
Publicat: (2024)
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
per: Zhang, Fred, et al.
Publicat: (2023)
per: Zhang, Fred, et al.
Publicat: (2023)
Bidirectional Variational Autoencoders
per: Kosko, Bart, et al.
Publicat: (2025)
per: Kosko, Bart, et al.
Publicat: (2025)
Convergent Linear Representations of Emergent Misalignment
per: Soligo, Anna, et al.
Publicat: (2025)
per: Soligo, Anna, et al.
Publicat: (2025)
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
per: Makelov, Aleksandar, et al.
Publicat: (2024)
per: Makelov, Aleksandar, et al.
Publicat: (2024)
Sparse Autoencoders, Again?
per: Lu, Yin, et al.
Publicat: (2025)
per: Lu, Yin, et al.
Publicat: (2025)
Model Organisms for Emergent Misalignment
per: Turner, Edward, et al.
Publicat: (2025)
per: Turner, Edward, et al.
Publicat: (2025)
Understanding Reasoning in Thinking Language Models via Steering Vectors
per: Venhoff, Constantin, et al.
Publicat: (2025)
per: Venhoff, Constantin, et al.
Publicat: (2025)
Base Models Know How to Reason, Thinking Models Learn When
per: Venhoff, Constantin, et al.
Publicat: (2025)
per: Venhoff, Constantin, et al.
Publicat: (2025)
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
per: Ferrando, Javier, et al.
Publicat: (2024)
per: Ferrando, Javier, et al.
Publicat: (2024)
Are Sparse Autoencoder Benchmarks Reliable?
per: Chanin, David
Publicat: (2026)
per: Chanin, David
Publicat: (2026)
Thought Branches: Interpreting LLM Reasoning Requires Resampling
per: Macar, Uzay, et al.
Publicat: (2025)
per: Macar, Uzay, et al.
Publicat: (2025)
Thought Anchors: Which LLM Reasoning Steps Matter?
per: Bogdan, Paul C., et al.
Publicat: (2025)
per: Bogdan, Paul C., et al.
Publicat: (2025)
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
per: Karvonen, Adam, et al.
Publicat: (2024)
per: Karvonen, Adam, et al.
Publicat: (2024)
Improving Sparse Autoencoder with Dynamic Attention
per: Wang, Dongsheng, et al.
Publicat: (2026)
per: Wang, Dongsheng, et al.
Publicat: (2026)
What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering
per: Maar, Jim, et al.
Publicat: (2026)
per: Maar, Jim, et al.
Publicat: (2026)
Scaling sparse feature circuit finding for in-context learning
per: Kharlapenko, Dmitrii, et al.
Publicat: (2025)
per: Kharlapenko, Dmitrii, et al.
Publicat: (2025)
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
per: Minder, Julian, et al.
Publicat: (2025)
per: Minder, Julian, et al.
Publicat: (2025)
Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers
per: Ji, Xiaotong, et al.
Publicat: (2026)
per: Ji, Xiaotong, et al.
Publicat: (2026)
Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders
per: Ayonrinde, Kola
Publicat: (2024)
per: Ayonrinde, Kola
Publicat: (2024)
Empirical Evaluation of Progressive Coding for Sparse Autoencoders
per: Peter, Hans, et al.
Publicat: (2025)
per: Peter, Hans, et al.
Publicat: (2025)
Do Sparse Autoencoders Capture Concept Manifolds?
per: Bhalla, Usha, et al.
Publicat: (2026)
per: Bhalla, Usha, et al.
Publicat: (2026)
On the transferability of Sparse Autoencoders for interpreting compressed models
per: Gupte, Suchit, et al.
Publicat: (2025)
per: Gupte, Suchit, et al.
Publicat: (2025)
Data Whitening Improves Sparse Autoencoder Learning
per: Saraswatula, Ashwin, et al.
Publicat: (2025)
per: Saraswatula, Ashwin, et al.
Publicat: (2025)
Interpreting Attention Layer Outputs with Sparse Autoencoders
per: Kissane, Connor, et al.
Publicat: (2024)
per: Kissane, Connor, et al.
Publicat: (2024)
Improving Robustness In Sparse Autoencoders via Masked Regularization
per: Narayanaswamy, Vivek, et al.
Publicat: (2026)
per: Narayanaswamy, Vivek, et al.
Publicat: (2026)
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
per: Lee, Sewoong, et al.
Publicat: (2025)
per: Lee, Sewoong, et al.
Publicat: (2025)
Graph-Regularized Sparse Autoencoders for LLM Safety Steering
per: Yeon, Jehyeok, et al.
Publicat: (2025)
per: Yeon, Jehyeok, et al.
Publicat: (2025)
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
per: Casademunt, Helena, et al.
Publicat: (2025)
per: Casademunt, Helena, et al.
Publicat: (2025)
Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
per: Arcuschin, Iván, et al.
Publicat: (2025)
per: Arcuschin, Iván, et al.
Publicat: (2025)
Ítems similars
-
Sparse Autoencoders Do Not Find Canonical Units of Analysis
per: Leask, Patrick, et al.
Publicat: (2025) -
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
per: Bussmann, Bart, et al.
Publicat: (2025) -
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
per: Kantamneni, Subhash, et al.
Publicat: (2025) -
Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs
per: Bozoukov, Matthew, et al.
Publicat: (2025) -
Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit
per: Jiang, Nick, et al.
Publicat: (2025)