Saved in:
| Main Authors: | Braun, Joschka, Eickhoff, Carsten, Krueger, David, Bahrainian, Seyed Ali, Krasheninnikov, Dmitrii |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.22637 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Beyond Multiple Choice: Evaluating Steering Vectors for Summarization
by: Braun, Joschka, et al.
Published: (2025)
by: Braun, Joschka, et al.
Published: (2025)
Logit Reweighting for Topic-Focused Summarization
by: Braun, Joschka, et al.
Published: (2025)
by: Braun, Joschka, et al.
Published: (2025)
Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations
by: Braun, Joschka
Published: (2026)
by: Braun, Joschka
Published: (2026)
Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks
by: Brumley, Madeline, et al.
Published: (2024)
by: Brumley, Madeline, et al.
Published: (2024)
Stress-Testing Capability Elicitation With Password-Locked Models
by: Greenblatt, Ryan, et al.
Published: (2024)
by: Greenblatt, Ryan, et al.
Published: (2024)
Implicit meta-learning may lead language models to trust more reliable sources
by: Krasheninnikov, Dmitrii, et al.
Published: (2023)
by: Krasheninnikov, Dmitrii, et al.
Published: (2023)
Fresh in memory: Training-order recency is linearly encoded in language model activations
by: Krasheninnikov, Dmitrii, et al.
Published: (2025)
by: Krasheninnikov, Dmitrii, et al.
Published: (2025)
Language Models Implement Simple Word2Vec-style Vector Arithmetic
by: Merullo, Jack, et al.
Published: (2023)
by: Merullo, Jack, et al.
Published: (2023)
Defining and Characterizing Reward Hacking
by: Skalse, Joar, et al.
Published: (2022)
by: Skalse, Joar, et al.
Published: (2022)
Enhancing Retrieval-Augmented Generation: A Study of Best Practices
by: Li, Siran, et al.
Published: (2025)
by: Li, Siran, et al.
Published: (2025)
Analyzing the Generalization and Reliability of Steering Vectors
by: Tan, Daniel, et al.
Published: (2024)
by: Tan, Daniel, et al.
Published: (2024)
Circuit Component Reuse Across Tasks in Transformer Language Models
by: Merullo, Jack, et al.
Published: (2023)
by: Merullo, Jack, et al.
Published: (2023)
When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?
by: Zhou, Xinyu, et al.
Published: (2026)
by: Zhou, Xinyu, et al.
Published: (2026)
Detecting High-Stakes Interactions with Activation Probes
by: McKenzie, Alex, et al.
Published: (2025)
by: McKenzie, Alex, et al.
Published: (2025)
Understanding Reasoning in Thinking Language Models via Steering Vectors
by: Venhoff, Constantin, et al.
Published: (2025)
by: Venhoff, Constantin, et al.
Published: (2025)
Stable Anisotropic Regularization
by: Rudman, William, et al.
Published: (2023)
by: Rudman, William, et al.
Published: (2023)
Fill in the Blanks: Accelerating Q-Learning with a Handful of Demonstrations in Sparse Reward Settings
by: Azad, Seyed Mahdi Basiri, et al.
Published: (2025)
by: Azad, Seyed Mahdi Basiri, et al.
Published: (2025)
Benchmarking is Broken -- Don't Let AI be its Own Judge
by: Cheng, Zerui, et al.
Published: (2025)
by: Cheng, Zerui, et al.
Published: (2025)
PiCME: Pipeline for Contrastive Modality Evaluation and Encoding in the MIMIC Dataset
by: Golovanevsky, Michal, et al.
Published: (2025)
by: Golovanevsky, Michal, et al.
Published: (2025)
On the Non-Identifiability of Steering Vectors in Large Language Models
by: Venkatesh, Sohan, et al.
Published: (2026)
by: Venkatesh, Sohan, et al.
Published: (2026)
Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models
by: Gan, Woody Haosheng, et al.
Published: (2025)
by: Gan, Woody Haosheng, et al.
Published: (2025)
Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts
by: Golovanevsky, Michal, et al.
Published: (2025)
by: Golovanevsky, Michal, et al.
Published: (2025)
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
by: Cao, Yuanpu, et al.
Published: (2024)
by: Cao, Yuanpu, et al.
Published: (2024)
SR-Reward: Taking The Path More Traveled
by: Azad, Seyed Mahdi B., et al.
Published: (2025)
by: Azad, Seyed Mahdi B., et al.
Published: (2025)
TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning
by: Beck, Florentin, et al.
Published: (2025)
by: Beck, Florentin, et al.
Published: (2025)
White-Box Sensitivity Auditing with Steering Vectors
by: Cyberey, Hannah, et al.
Published: (2026)
by: Cyberey, Hannah, et al.
Published: (2026)
Beyond One-Time Validation: A Framework for Adaptive Validation of Prognostic and Diagnostic AI-based Medical Devices
by: Hellmeier, Florian, et al.
Published: (2024)
by: Hellmeier, Florian, et al.
Published: (2024)
One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data
by: Golovanevsky, Michal, et al.
Published: (2023)
by: Golovanevsky, Michal, et al.
Published: (2023)
Explainable AI (XAI) for Arrhythmia detection from electrocardiograms
by: Beck, Joschka, et al.
Published: (2025)
by: Beck, Joschka, et al.
Published: (2025)
Predicting Where Steering Vectors Succeed
by: Billa, Jayadev
Published: (2026)
by: Billa, Jayadev
Published: (2026)
Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions
by: Bao, Yuntai, et al.
Published: (2026)
by: Bao, Yuntai, et al.
Published: (2026)
Does TabPFN Understand Causal Structures?
by: Swelam, Omar, et al.
Published: (2025)
by: Swelam, Omar, et al.
Published: (2025)
From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures
by: Rottach, Florian, et al.
Published: (2025)
by: Rottach, Florian, et al.
Published: (2025)
The Impact of Off-Policy Training Data on Probe Generalisation
by: Kirch, Nathalie, et al.
Published: (2025)
by: Kirch, Nathalie, et al.
Published: (2025)
Steering Large Language Model Activations in Sparse Spaces
by: Bayat, Reza, et al.
Published: (2025)
by: Bayat, Reza, et al.
Published: (2025)
Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens
by: Anwar, Usman, et al.
Published: (2024)
by: Anwar, Usman, et al.
Published: (2024)
Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
by: Jin, Zehao, et al.
Published: (2026)
by: Jin, Zehao, et al.
Published: (2026)
Steering Language Model Refusal with Sparse Autoencoders
by: O'Brien, Kyle, et al.
Published: (2024)
by: O'Brien, Kyle, et al.
Published: (2024)
Probabilistic Recurrent Intention Switching Model
by: Sheng, Wenyuan, et al.
Published: (2026)
by: Sheng, Wenyuan, et al.
Published: (2026)
Steering Language Models With Activation Engineering
by: Turner, Alexander Matt, et al.
Published: (2023)
by: Turner, Alexander Matt, et al.
Published: (2023)
Similar Items
-
Beyond Multiple Choice: Evaluating Steering Vectors for Summarization
by: Braun, Joschka, et al.
Published: (2025) -
Logit Reweighting for Topic-Focused Summarization
by: Braun, Joschka, et al.
Published: (2025) -
Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations
by: Braun, Joschka
Published: (2026) -
Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks
by: Brumley, Madeline, et al.
Published: (2024) -
Stress-Testing Capability Elicitation With Password-Locked Models
by: Greenblatt, Ryan, et al.
Published: (2024)