Saved in:
| Main Author: | Xie, Jiaqing |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.01246 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering
by: Zhao, Haiyan, et al.
Published: (2025)
by: Zhao, Haiyan, et al.
Published: (2025)
SteerRM: Debiasing Reward Models via Sparse Autoencoders
by: Sun, Mengyuan, et al.
Published: (2026)
by: Sun, Mengyuan, et al.
Published: (2026)
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
by: He, Zirui, et al.
Published: (2025)
by: He, Zirui, et al.
Published: (2025)
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs
by: Härle, Ruben, et al.
Published: (2024)
by: Härle, Ruben, et al.
Published: (2024)
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
by: Wu, Xuansheng, et al.
Published: (2025)
by: Wu, Xuansheng, et al.
Published: (2025)
Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection
by: Ghussin, Yusser Al, et al.
Published: (2026)
by: Ghussin, Yusser Al, et al.
Published: (2026)
Improving Steering Vectors by Targeting Sparse Autoencoder Features
by: Chalnev, Sviatoslav, et al.
Published: (2024)
by: Chalnev, Sviatoslav, et al.
Published: (2024)
Controllable LLM Reasoning via Sparse Autoencoder-Based Steering
by: Fang, Yi, et al.
Published: (2026)
by: Fang, Yi, et al.
Published: (2026)
Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation
by: Hua, Zhenglin, et al.
Published: (2025)
by: Hua, Zhenglin, et al.
Published: (2025)
Activation Steering for Masked Diffusion Language Models
by: Shnaidman, Adi, et al.
Published: (2025)
by: Shnaidman, Adi, et al.
Published: (2025)
Contextual Linear Activation Steering of Language Models
by: Hsu, Brandon, et al.
Published: (2026)
by: Hsu, Brandon, et al.
Published: (2026)
Steering Language Models With Activation Engineering
by: Turner, Alexander Matt, et al.
Published: (2023)
by: Turner, Alexander Matt, et al.
Published: (2023)
Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations
by: Joshi, Shruti, et al.
Published: (2025)
by: Joshi, Shruti, et al.
Published: (2025)
Understanding Refusal in Language Models with Sparse Autoencoders
by: Yeo, Wei Jie, et al.
Published: (2025)
by: Yeo, Wei Jie, et al.
Published: (2025)
SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders
by: Yu, Zhuohao, et al.
Published: (2025)
by: Yu, Zhuohao, et al.
Published: (2025)
REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering
by: Zhan, Li-Ming, et al.
Published: (2025)
by: Zhan, Li-Ming, et al.
Published: (2025)
Activation Scaling for Steering and Interpreting Language Models
by: Stoehr, Niklas, et al.
Published: (2024)
by: Stoehr, Niklas, et al.
Published: (2024)
Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines
by: Jørgensen, Mikkel Godsk, et al.
Published: (2026)
by: Jørgensen, Mikkel Godsk, et al.
Published: (2026)
LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models
by: Yang, Jingyuan, et al.
Published: (2025)
by: Yang, Jingyuan, et al.
Published: (2025)
Controlling Large Language Model Agents with Entropic Activation Steering
by: Rahn, Nate, et al.
Published: (2024)
by: Rahn, Nate, et al.
Published: (2024)
SAFER: Probing Safety in Reward Models with Sparse Autoencoder
by: Shi, Wei, et al.
Published: (2025)
by: Shi, Wei, et al.
Published: (2025)
Cross-Lingual Activation Steering for Multilingual Language Models
by: Pokharel, Rhitabrat, et al.
Published: (2026)
by: Pokharel, Rhitabrat, et al.
Published: (2026)
Endogenous Resistance to Activation Steering in Language Models
by: McKenzie, Alex, et al.
Published: (2026)
by: McKenzie, Alex, et al.
Published: (2026)
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
by: Farnik, Lucy, et al.
Published: (2025)
by: Farnik, Lucy, et al.
Published: (2025)
CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features
by: Cho, Seonglae, et al.
Published: (2025)
by: Cho, Seonglae, et al.
Published: (2025)
Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
by: Deng, Boyi, et al.
Published: (2025)
by: Deng, Boyi, et al.
Published: (2025)
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
by: Zhang, Ruikang, et al.
Published: (2026)
by: Zhang, Ruikang, et al.
Published: (2026)
Spherical Steering: Geometry-Aware Activation Rotation for Language Models
by: You, Zejia, et al.
Published: (2026)
by: You, Zejia, et al.
Published: (2026)
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
by: Karvonen, Adam, et al.
Published: (2025)
by: Karvonen, Adam, et al.
Published: (2025)
SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder
by: Liu, Dengcan, et al.
Published: (2025)
by: Liu, Dengcan, et al.
Published: (2025)
Enhancing Cross-task Transfer of Large Language Models via Activation Steering
by: Tang, Xinyu, et al.
Published: (2025)
by: Tang, Xinyu, et al.
Published: (2025)
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
by: Wu, Zhengxuan, et al.
Published: (2025)
by: Wu, Zhengxuan, et al.
Published: (2025)
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
by: He, Zirui, et al.
Published: (2025)
by: He, Zirui, et al.
Published: (2025)
Achieving Sparse Activation in Small Language Models
by: Song, Jifeng, et al.
Published: (2024)
by: Song, Jifeng, et al.
Published: (2024)
Improving Instruction-Following in Language Models through Activation Steering
by: Stolfo, Alessandro, et al.
Published: (2024)
by: Stolfo, Alessandro, et al.
Published: (2024)
Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models
by: Yamashita, Tomoya, et al.
Published: (2025)
by: Yamashita, Tomoya, et al.
Published: (2025)
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
by: O'Neill, Charles, et al.
Published: (2024)
by: O'Neill, Charles, et al.
Published: (2024)
Fine-Grained Activation Steering: Steering Less, Achieving More
by: Feng, Zijian, et al.
Published: (2026)
by: Feng, Zijian, et al.
Published: (2026)
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
by: Shu, Dong, et al.
Published: (2025)
by: Shu, Dong, et al.
Published: (2025)
Do Large Language Models Truly Understand Cross-cultural Differences?
by: Guo, Shiwei, et al.
Published: (2025)
by: Guo, Shiwei, et al.
Published: (2025)
Similar Items
-
Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering
by: Zhao, Haiyan, et al.
Published: (2025) -
SteerRM: Debiasing Reward Models via Sparse Autoencoders
by: Sun, Mengyuan, et al.
Published: (2026) -
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
by: He, Zirui, et al.
Published: (2025) -
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs
by: Härle, Ruben, et al.
Published: (2024) -
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
by: Wu, Xuansheng, et al.
Published: (2025)