Saved in:
| Main Authors: | Liu, Shunchang, Chen, Xin, Urcelay, Belen Martin, Croce, Francesco |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.16339 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Interpretable Reward Model via Sparse Autoencoder
by: Zhang, Shuyi, et al.
Published: (2025)
by: Zhang, Shuyi, et al.
Published: (2025)
Feature Starvation as Geometric Instability in Sparse Autoencoders
by: Chaudhry, Faris, et al.
Published: (2026)
by: Chaudhry, Faris, et al.
Published: (2026)
Toward Identifiable Sparse Autoencoders
by: Nelson, Walter, et al.
Published: (2026)
by: Nelson, Walter, et al.
Published: (2026)
Beyond Labels: Information-Efficient Human-in-the-Loop Learning using Ranking and Selection Queries
by: Martín-Urcelay, Belén, et al.
Published: (2026)
by: Martín-Urcelay, Belén, et al.
Published: (2026)
Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation
by: Hua, Zhenglin, et al.
Published: (2025)
by: Hua, Zhenglin, et al.
Published: (2025)
Sparse Autoencoders are Capable LLM Jailbreak Mitigators
by: Assogba, Yannick, et al.
Published: (2026)
by: Assogba, Yannick, et al.
Published: (2026)
Multimodal Variational Autoencoder for Low-cost Cardiac Hemodynamics Instability Detection
by: Suvon, Mohammod N. I., et al.
Published: (2024)
by: Suvon, Mohammod N. I., et al.
Published: (2024)
Low-Rank Adapting Models for Sparse Autoencoders
by: Chen, Matthew, et al.
Published: (2025)
by: Chen, Matthew, et al.
Published: (2025)
Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
by: Miao, Yuchun, et al.
Published: (2025)
by: Miao, Yuchun, et al.
Published: (2025)
Exploring and Addressing Reward Confusion in Offline Preference Learning
by: Chen, Xin, et al.
Published: (2024)
by: Chen, Xin, et al.
Published: (2024)
Steering Language Model Refusal with Sparse Autoencoders
by: O'Brien, Kyle, et al.
Published: (2024)
by: O'Brien, Kyle, et al.
Published: (2024)
Selective Induction Heads: How Transformers Select Causal Structures In Context
by: D'Angelo, Francesco, et al.
Published: (2025)
by: D'Angelo, Francesco, et al.
Published: (2025)
Attribution-Guided Distillation of Matryoshka Sparse Autoencoders
by: Martin-Linares, Cristina P., et al.
Published: (2025)
by: Martin-Linares, Cristina P., et al.
Published: (2025)
Preference-Guided Learning for Sparse-Reward Multi-Agent Reinforcement Learning
by: Bui, The Viet, et al.
Published: (2025)
by: Bui, The Viet, et al.
Published: (2025)
Ensembling Sparse Autoencoders
by: Gadgil, Soham, et al.
Published: (2025)
by: Gadgil, Soham, et al.
Published: (2025)
Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking
by: Beigi, Mohammad, et al.
Published: (2026)
by: Beigi, Mohammad, et al.
Published: (2026)
Do Sparse Autoencoders Identify Reasoning Features in Language Models?
by: Ma, George, et al.
Published: (2026)
by: Ma, George, et al.
Published: (2026)
Model Unlearning via Sparse Autoencoder Subspace Guided Projections
by: Wang, Xu, et al.
Published: (2025)
by: Wang, Xu, et al.
Published: (2025)
Generalizing Reward Modeling for Out-of-Distribution Preference Learning
by: Jia, Chen
Published: (2024)
by: Jia, Chen
Published: (2024)
Training Superior Sparse Autoencoders for Instruct Models
by: Li, Jiaming, et al.
Published: (2025)
by: Li, Jiaming, et al.
Published: (2025)
Sparse Autoencoders are Topic Models
by: Girrbach, Leander, et al.
Published: (2025)
by: Girrbach, Leander, et al.
Published: (2025)
Towards Reliable Evaluation and Fast Training of Robust Semantic Segmentation Models
by: Croce, Francesco, et al.
Published: (2023)
by: Croce, Francesco, et al.
Published: (2023)
Analysis of Variational Sparse Autoencoders
by: Baker, Zachary, et al.
Published: (2025)
by: Baker, Zachary, et al.
Published: (2025)
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
by: Miao, Yuchun, et al.
Published: (2024)
by: Miao, Yuchun, et al.
Published: (2024)
Sparse Autoencoders, Again?
by: Lu, Yin, et al.
Published: (2025)
by: Lu, Yin, et al.
Published: (2025)
FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens
by: Schlarmann, Christian, et al.
Published: (2025)
by: Schlarmann, Christian, et al.
Published: (2025)
Route Sparse Autoencoder to Interpret Large Language Models
by: Shi, Wei, et al.
Published: (2025)
by: Shi, Wei, et al.
Published: (2025)
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
by: Yu, Xin, et al.
Published: (2026)
by: Yu, Xin, et al.
Published: (2026)
PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model
by: Lin, Baijiong, et al.
Published: (2025)
by: Lin, Baijiong, et al.
Published: (2025)
Learning Retrieval Models with Sparse Autoencoders
by: Formal, Thibault, et al.
Published: (2026)
by: Formal, Thibault, et al.
Published: (2026)
Reward Model Ensembles Help Mitigate Overoptimization
by: Coste, Thomas, et al.
Published: (2023)
by: Coste, Thomas, et al.
Published: (2023)
Improving Robustness In Sparse Autoencoders via Masked Regularization
by: Narayanaswamy, Vivek, et al.
Published: (2026)
by: Narayanaswamy, Vivek, et al.
Published: (2026)
Reward Learning From Preference With Ties
by: Liu, Jinsong, et al.
Published: (2024)
by: Liu, Jinsong, et al.
Published: (2024)
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
by: Wang, Haoxiang, et al.
Published: (2024)
by: Wang, Haoxiang, et al.
Published: (2024)
Are Sparse Autoencoders Useful for Java Function Bug Detection?
by: Melo, Rui, et al.
Published: (2025)
by: Melo, Rui, et al.
Published: (2025)
SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling
by: Lou, Xingzhou, et al.
Published: (2024)
by: Lou, Xingzhou, et al.
Published: (2024)
PILAF: Optimal Human Preference Sampling for Reward Modeling
by: Feng, Yunzhen, et al.
Published: (2025)
by: Feng, Yunzhen, et al.
Published: (2025)
Explicit Preference Optimization: No Need for an Implicit Reward Model
by: Hu, Xiangkun, et al.
Published: (2025)
by: Hu, Xiangkun, et al.
Published: (2025)
Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction
by: Song, Ruike, et al.
Published: (2025)
by: Song, Ruike, et al.
Published: (2025)
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
by: Eisenstein, Jacob, et al.
Published: (2023)
by: Eisenstein, Jacob, et al.
Published: (2023)
Similar Items
-
Interpretable Reward Model via Sparse Autoencoder
by: Zhang, Shuyi, et al.
Published: (2025) -
Feature Starvation as Geometric Instability in Sparse Autoencoders
by: Chaudhry, Faris, et al.
Published: (2026) -
Toward Identifiable Sparse Autoencoders
by: Nelson, Walter, et al.
Published: (2026) -
Beyond Labels: Information-Efficient Human-in-the-Loop Learning using Ranking and Selection Queries
by: Martín-Urcelay, Belén, et al.
Published: (2026) -
Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation
by: Hua, Zhenglin, et al.
Published: (2025)