:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ronge, Raphael, Maier, Markus, Eberhardt, Frederick
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2601.03047
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Interpretable Steering of Large Language Models with Feature Guided Activation Additions
by: Soo, Samuel, et al.
Published: (2025)

Lost in Aggregation: The Causal Interpretation of the IV Estimand
by: Tsao, Danielle, et al.
Published: (2026)

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
by: Laptev, Daniil, et al.
Published: (2025)

Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
by: Li, Zihao, et al.
Published: (2025)

Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
by: Song, Xiangchen, et al.
Published: (2025)

Controlling for discrete unmeasured confounding in nonlinear causal models
by: Burauel, Patrick, et al.
Published: (2024)

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
by: Gonzalez, ML Nissen, et al.
Published: (2026)

Interpretable Prediction and Feature Selection for Survival Analysis
by: Van Ness, Mike, et al.
Published: (2024)

DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI
by: Cho, Hyowon, et al.
Published: (2024)

Comparing Feature Importance and Rule Extraction for Interpretability on Text Data
by: Lopardo, Gianluigi, et al.
Published: (2022)

Mechanistic Permutability: Match Features Across Layers
by: Balagansky, Nikita, et al.
Published: (2024)

Investigating Graph Neural Networks and Classical Feature-Extraction Techniques in Activity-Cliff and Molecular Property Prediction
by: Dablander, Markus
Published: (2024)

Lower Bounds on the Size of Markov Equivalence Classes
by: Jahn, Erik, et al.
Published: (2025)

Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
by: Mohan, Vamshi Sunku, et al.
Published: (2026)

PHLP: Sole Persistent Homology for Link Prediction - Interpretable Feature Extraction
by: You, Junwon, et al.
Published: (2024)

To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models
by: Hedström, Anna, et al.
Published: (2025)

IDP-PGFE: An Interpretable Disruption Predictor based on Physics-Guided Feature Extraction
by: Shen, Chengshuo, et al.
Published: (2022)

When a Zero-Shooter Cheats: Improving Age Estimation via Activation Steering
by: Imgrund, Erik, et al.
Published: (2026)

Focus On This, Not That! Steering LLMs with Adaptive Feature Specification
by: Lamb, Tom A., et al.
Published: (2024)

Mind the Performance Gap: Capability-Behavior Trade-offs in Feature Steering
by: Sprejer, Eitan, et al.
Published: (2026)

Steered Generation via Gradient-Based Optimization on Sparse Query Features
by: Bhattacharyya, Sumanta, et al.
Published: (2026)

Stable and Interpretable Jet Physics with IRC-Safe Equivariant Feature Extraction
by: Konar, Partha, et al.
Published: (2025)

Interpretable Features for the Assessment of Neurodegenerative Diseases through Handwriting Analysis
by: Thebaud, Thomas, et al.
Published: (2024)

Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs
by: Harshavardhan
Published: (2026)

Improving Steering Vectors by Targeting Sparse Autoencoder Features
by: Chalnev, Sviatoslav, et al.
Published: (2024)

SAEs Are Good for Steering -- If You Select the Right Features
by: Arad, Dana, et al.
Published: (2025)

Student sentiment Analysis Using Classification With Feature Extraction Techniques
by: Tamrakar, Latika, et al.
Published: (2021)

Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
by: Xu, Yang, et al.
Published: (2024)

Local Feature Selection without Label or Feature Leakage for Interpretable Machine Learning Predictions
by: Oosterhuis, Harrie, et al.
Published: (2024)

SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs
by: Ghosh, Shaona, et al.
Published: (2025)

Explaining Concept Shift with Interpretable Feature Attribution
by: Lyu, Ruiqi, et al.
Published: (2025)

Semantic-Guided RL for Interpretable Feature Engineering
by: Bouadi, Mohamed, et al.
Published: (2024)

Gradient Boosting Mapping for Dimensionality Reduction and Feature Extraction
by: Patron, Anri, et al.
Published: (2024)

Exemplar Partitioning for Mechanistic Interpretability
by: Rumbelow, Jessica
Published: (2026)

From Mechanistic to Compositional Interpretability
by: Gauderis, Ward, et al.
Published: (2026)

Open Problems in Mechanistic Interpretability
by: Sharkey, Lee, et al.
Published: (2025)

Feature-Based Interpretable Surrogates for Optimization
by: Goerigk, Marc, et al.
Published: (2024)

When Can You Trust Your Explanations? A Robustness Analysis on Feature Importances
by: Vascotto, Ilaria, et al.
Published: (2024)

Interpreting Emergent Features in Deep Learning-based Side-channel Analysis
by: Karayalçin, Sengim, et al.
Published: (2025)

Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features
by: Cho, Seonglae, et al.
Published: (2026)