Saved in:
| Main Authors: | Ronge, Raphael, Maier, Markus, Eberhardt, Frederick |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.03047 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Interpretable Steering of Large Language Models with Feature Guided Activation Additions
by: Soo, Samuel, et al.
Published: (2025)
by: Soo, Samuel, et al.
Published: (2025)
Lost in Aggregation: The Causal Interpretation of the IV Estimand
by: Tsao, Danielle, et al.
Published: (2026)
by: Tsao, Danielle, et al.
Published: (2026)
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
by: Laptev, Daniil, et al.
Published: (2025)
by: Laptev, Daniil, et al.
Published: (2025)
Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
by: Li, Zihao, et al.
Published: (2025)
by: Li, Zihao, et al.
Published: (2025)
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
by: Song, Xiangchen, et al.
Published: (2025)
by: Song, Xiangchen, et al.
Published: (2025)
Controlling for discrete unmeasured confounding in nonlinear causal models
by: Burauel, Patrick, et al.
Published: (2024)
by: Burauel, Patrick, et al.
Published: (2024)
When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
by: Gonzalez, ML Nissen, et al.
Published: (2026)
by: Gonzalez, ML Nissen, et al.
Published: (2026)
Interpretable Prediction and Feature Selection for Survival Analysis
by: Van Ness, Mike, et al.
Published: (2024)
by: Van Ness, Mike, et al.
Published: (2024)
DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI
by: Cho, Hyowon, et al.
Published: (2024)
by: Cho, Hyowon, et al.
Published: (2024)
Comparing Feature Importance and Rule Extraction for Interpretability on Text Data
by: Lopardo, Gianluigi, et al.
Published: (2022)
by: Lopardo, Gianluigi, et al.
Published: (2022)
Mechanistic Permutability: Match Features Across Layers
by: Balagansky, Nikita, et al.
Published: (2024)
by: Balagansky, Nikita, et al.
Published: (2024)
Investigating Graph Neural Networks and Classical Feature-Extraction Techniques in Activity-Cliff and Molecular Property Prediction
by: Dablander, Markus
Published: (2024)
by: Dablander, Markus
Published: (2024)
Lower Bounds on the Size of Markov Equivalence Classes
by: Jahn, Erik, et al.
Published: (2025)
by: Jahn, Erik, et al.
Published: (2025)
Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
by: Mohan, Vamshi Sunku, et al.
Published: (2026)
by: Mohan, Vamshi Sunku, et al.
Published: (2026)
PHLP: Sole Persistent Homology for Link Prediction - Interpretable Feature Extraction
by: You, Junwon, et al.
Published: (2024)
by: You, Junwon, et al.
Published: (2024)
To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models
by: Hedström, Anna, et al.
Published: (2025)
by: Hedström, Anna, et al.
Published: (2025)
IDP-PGFE: An Interpretable Disruption Predictor based on Physics-Guided Feature Extraction
by: Shen, Chengshuo, et al.
Published: (2022)
by: Shen, Chengshuo, et al.
Published: (2022)
When a Zero-Shooter Cheats: Improving Age Estimation via Activation Steering
by: Imgrund, Erik, et al.
Published: (2026)
by: Imgrund, Erik, et al.
Published: (2026)
Focus On This, Not That! Steering LLMs with Adaptive Feature Specification
by: Lamb, Tom A., et al.
Published: (2024)
by: Lamb, Tom A., et al.
Published: (2024)
Mind the Performance Gap: Capability-Behavior Trade-offs in Feature Steering
by: Sprejer, Eitan, et al.
Published: (2026)
by: Sprejer, Eitan, et al.
Published: (2026)
Steered Generation via Gradient-Based Optimization on Sparse Query Features
by: Bhattacharyya, Sumanta, et al.
Published: (2026)
by: Bhattacharyya, Sumanta, et al.
Published: (2026)
Stable and Interpretable Jet Physics with IRC-Safe Equivariant Feature Extraction
by: Konar, Partha, et al.
Published: (2025)
by: Konar, Partha, et al.
Published: (2025)
Interpretable Features for the Assessment of Neurodegenerative Diseases through Handwriting Analysis
by: Thebaud, Thomas, et al.
Published: (2024)
by: Thebaud, Thomas, et al.
Published: (2024)
Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs
by: Harshavardhan
Published: (2026)
by: Harshavardhan
Published: (2026)
Improving Steering Vectors by Targeting Sparse Autoencoder Features
by: Chalnev, Sviatoslav, et al.
Published: (2024)
by: Chalnev, Sviatoslav, et al.
Published: (2024)
SAEs Are Good for Steering -- If You Select the Right Features
by: Arad, Dana, et al.
Published: (2025)
by: Arad, Dana, et al.
Published: (2025)
Student sentiment Analysis Using Classification With Feature Extraction Techniques
by: Tamrakar, Latika, et al.
Published: (2021)
by: Tamrakar, Latika, et al.
Published: (2021)
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
by: Xu, Yang, et al.
Published: (2024)
by: Xu, Yang, et al.
Published: (2024)
Local Feature Selection without Label or Feature Leakage for Interpretable Machine Learning Predictions
by: Oosterhuis, Harrie, et al.
Published: (2024)
by: Oosterhuis, Harrie, et al.
Published: (2024)
SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs
by: Ghosh, Shaona, et al.
Published: (2025)
by: Ghosh, Shaona, et al.
Published: (2025)
Explaining Concept Shift with Interpretable Feature Attribution
by: Lyu, Ruiqi, et al.
Published: (2025)
by: Lyu, Ruiqi, et al.
Published: (2025)
Semantic-Guided RL for Interpretable Feature Engineering
by: Bouadi, Mohamed, et al.
Published: (2024)
by: Bouadi, Mohamed, et al.
Published: (2024)
Gradient Boosting Mapping for Dimensionality Reduction and Feature Extraction
by: Patron, Anri, et al.
Published: (2024)
by: Patron, Anri, et al.
Published: (2024)
Exemplar Partitioning for Mechanistic Interpretability
by: Rumbelow, Jessica
Published: (2026)
by: Rumbelow, Jessica
Published: (2026)
From Mechanistic to Compositional Interpretability
by: Gauderis, Ward, et al.
Published: (2026)
by: Gauderis, Ward, et al.
Published: (2026)
Open Problems in Mechanistic Interpretability
by: Sharkey, Lee, et al.
Published: (2025)
by: Sharkey, Lee, et al.
Published: (2025)
Feature-Based Interpretable Surrogates for Optimization
by: Goerigk, Marc, et al.
Published: (2024)
by: Goerigk, Marc, et al.
Published: (2024)
When Can You Trust Your Explanations? A Robustness Analysis on Feature Importances
by: Vascotto, Ilaria, et al.
Published: (2024)
by: Vascotto, Ilaria, et al.
Published: (2024)
Interpreting Emergent Features in Deep Learning-based Side-channel Analysis
by: Karayalçin, Sengim, et al.
Published: (2025)
by: Karayalçin, Sengim, et al.
Published: (2025)
Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features
by: Cho, Seonglae, et al.
Published: (2026)
by: Cho, Seonglae, et al.
Published: (2026)
Similar Items
-
Interpretable Steering of Large Language Models with Feature Guided Activation Additions
by: Soo, Samuel, et al.
Published: (2025) -
Lost in Aggregation: The Causal Interpretation of the IV Estimand
by: Tsao, Danielle, et al.
Published: (2026) -
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
by: Laptev, Daniil, et al.
Published: (2025) -
Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
by: Li, Zihao, et al.
Published: (2025) -
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
by: Song, Xiangchen, et al.
Published: (2025)