:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Braun, Joschka, Eickhoff, Carsten, Krueger, David, Bahrainian, Seyed Ali, Krasheninnikov, Dmitrii
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2505.22637
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Beyond Multiple Choice: Evaluating Steering Vectors for Summarization
by: Braun, Joschka, et al.
Published: (2025)

Logit Reweighting for Topic-Focused Summarization
by: Braun, Joschka, et al.
Published: (2025)

Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations
by: Braun, Joschka
Published: (2026)

Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks
by: Brumley, Madeline, et al.
Published: (2024)

Stress-Testing Capability Elicitation With Password-Locked Models
by: Greenblatt, Ryan, et al.
Published: (2024)

Implicit meta-learning may lead language models to trust more reliable sources
by: Krasheninnikov, Dmitrii, et al.
Published: (2023)

Fresh in memory: Training-order recency is linearly encoded in language model activations
by: Krasheninnikov, Dmitrii, et al.
Published: (2025)

Language Models Implement Simple Word2Vec-style Vector Arithmetic
by: Merullo, Jack, et al.
Published: (2023)

Defining and Characterizing Reward Hacking
by: Skalse, Joar, et al.
Published: (2022)

Enhancing Retrieval-Augmented Generation: A Study of Best Practices
by: Li, Siran, et al.
Published: (2025)

Analyzing the Generalization and Reliability of Steering Vectors
by: Tan, Daniel, et al.
Published: (2024)

Circuit Component Reuse Across Tasks in Transformer Language Models
by: Merullo, Jack, et al.
Published: (2023)

When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?
by: Zhou, Xinyu, et al.
Published: (2026)

Detecting High-Stakes Interactions with Activation Probes
by: McKenzie, Alex, et al.
Published: (2025)

Understanding Reasoning in Thinking Language Models via Steering Vectors
by: Venhoff, Constantin, et al.
Published: (2025)

Stable Anisotropic Regularization
by: Rudman, William, et al.
Published: (2023)

Fill in the Blanks: Accelerating Q-Learning with a Handful of Demonstrations in Sparse Reward Settings
by: Azad, Seyed Mahdi Basiri, et al.
Published: (2025)

Benchmarking is Broken -- Don't Let AI be its Own Judge
by: Cheng, Zerui, et al.
Published: (2025)

PiCME: Pipeline for Contrastive Modality Evaluation and Encoding in the MIMIC Dataset
by: Golovanevsky, Michal, et al.
Published: (2025)

On the Non-Identifiability of Steering Vectors in Large Language Models
by: Venkatesh, Sohan, et al.
Published: (2026)

Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models
by: Gan, Woody Haosheng, et al.
Published: (2025)

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts
by: Golovanevsky, Michal, et al.
Published: (2025)

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
by: Cao, Yuanpu, et al.
Published: (2024)

SR-Reward: Taking The Path More Traveled
by: Azad, Seyed Mahdi B., et al.
Published: (2025)

TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning
by: Beck, Florentin, et al.
Published: (2025)

White-Box Sensitivity Auditing with Steering Vectors
by: Cyberey, Hannah, et al.
Published: (2026)

Beyond One-Time Validation: A Framework for Adaptive Validation of Prognostic and Diagnostic AI-based Medical Devices
by: Hellmeier, Florian, et al.
Published: (2024)

One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data
by: Golovanevsky, Michal, et al.
Published: (2023)

Explainable AI (XAI) for Arrhythmia detection from electrocardiograms
by: Beck, Joschka, et al.
Published: (2025)

Predicting Where Steering Vectors Succeed
by: Billa, Jayadev
Published: (2026)

Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions
by: Bao, Yuntai, et al.
Published: (2026)

Does TabPFN Understand Causal Structures?
by: Swelam, Omar, et al.
Published: (2025)

From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures
by: Rottach, Florian, et al.
Published: (2025)

The Impact of Off-Policy Training Data on Probe Generalisation
by: Kirch, Nathalie, et al.
Published: (2025)

Steering Large Language Model Activations in Sparse Spaces
by: Bayat, Reza, et al.
Published: (2025)

Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens
by: Anwar, Usman, et al.
Published: (2024)

Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
by: Jin, Zehao, et al.
Published: (2026)

Steering Language Model Refusal with Sparse Autoencoders
by: O'Brien, Kyle, et al.
Published: (2024)

Probabilistic Recurrent Intention Switching Model
by: Sheng, Wenyuan, et al.
Published: (2026)

Steering Language Models With Activation Engineering
by: Turner, Alexander Matt, et al.
Published: (2023)