Saved in:
| Main Authors: | Lu, Dawn, Rimsky, Nina |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.00402 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Steering Llama 2 via Contrastive Activation Addition
by: Panickssery, Nina, et al.
Published: (2023)
by: Panickssery, Nina, et al.
Published: (2023)
AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs
by: Fathullah, Yassir, et al.
Published: (2023)
by: Fathullah, Yassir, et al.
Published: (2023)
SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs
by: Siu, Vincent, et al.
Published: (2025)
by: Siu, Vincent, et al.
Published: (2025)
Steering Awareness: Detecting Activation Steering from Within
by: Rivera, Joshua Fonseca, et al.
Published: (2025)
by: Rivera, Joshua Fonseca, et al.
Published: (2025)
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
by: Kang, Diancheng, et al.
Published: (2026)
by: Kang, Diancheng, et al.
Published: (2026)
Forbidden Facts: An Investigation of Competing Objectives in Llama-2
by: Wang, Tony T., et al.
Published: (2023)
by: Wang, Tony T., et al.
Published: (2023)
HyperSteer: Activation Steering at Scale with Hypernetworks
by: Sun, Jiuding, et al.
Published: (2025)
by: Sun, Jiuding, et al.
Published: (2025)
TinyLlama: An Open-Source Small Language Model
by: Zhang, Peiyuan, et al.
Published: (2024)
by: Zhang, Peiyuan, et al.
Published: (2024)
MGH Radiology Llama: A Llama 3 70B Model for Radiology
by: Shi, Yucheng, et al.
Published: (2024)
by: Shi, Yucheng, et al.
Published: (2024)
Steering Towards Fairness: Mitigating Political Bias in LLMs
by: Nadeem, Afrozah, et al.
Published: (2025)
by: Nadeem, Afrozah, et al.
Published: (2025)
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
by: Ackerman, Christopher, et al.
Published: (2024)
by: Ackerman, Christopher, et al.
Published: (2024)
Steer Like the LLM: Activation Steering that Mimics Prompting
by: Heyman, Geert, et al.
Published: (2026)
by: Heyman, Geert, et al.
Published: (2026)
Activation Scaling for Steering and Interpreting Language Models
by: Stoehr, Niklas, et al.
Published: (2024)
by: Stoehr, Niklas, et al.
Published: (2024)
Fusion Steering: Prompt-Specific Activation Control
by: Chang, Waldemar, et al.
Published: (2025)
by: Chang, Waldemar, et al.
Published: (2025)
RepIt: Steering Language Models with Concept-Specific Refusal Vectors
by: Siu, Vincent, et al.
Published: (2025)
by: Siu, Vincent, et al.
Published: (2025)
When Wording Steers the Evaluation: Framing Bias in LLM judges
by: Hwang, Yerin, et al.
Published: (2026)
by: Hwang, Yerin, et al.
Published: (2026)
Cross-Lingual Activation Steering for Multilingual Language Models
by: Pokharel, Rhitabrat, et al.
Published: (2026)
by: Pokharel, Rhitabrat, et al.
Published: (2026)
Bias Beyond Borders: Political Ideology Evaluation and Steering in Multilingual LLMs
by: Nadeem, Afrozah, et al.
Published: (2026)
by: Nadeem, Afrozah, et al.
Published: (2026)
Programming Refusal with Conditional Activation Steering
by: Lee, Bruce W., et al.
Published: (2024)
by: Lee, Bruce W., et al.
Published: (2024)
SAKE: Steering Activations for Knowledge Editing
by: Scialanga, Marco, et al.
Published: (2025)
by: Scialanga, Marco, et al.
Published: (2025)
Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning
by: Wannan, et al.
Published: (2025)
by: Wannan, et al.
Published: (2025)
DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion
by: Li, Yu, et al.
Published: (2024)
by: Li, Yu, et al.
Published: (2024)
Open Llama2 Model for the Lithuanian Language
by: Nakvosas, Artūras, et al.
Published: (2024)
by: Nakvosas, Artūras, et al.
Published: (2024)
Endogenous Resistance to Activation Steering in Language Models
by: McKenzie, Alex, et al.
Published: (2026)
by: McKenzie, Alex, et al.
Published: (2026)
Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes
by: Masoudian, Shahed, et al.
Published: (2025)
by: Masoudian, Shahed, et al.
Published: (2025)
On Effects of Steering Latent Representation for Large Language Model Unlearning
by: Huu-Tien, Dang, et al.
Published: (2024)
by: Huu-Tien, Dang, et al.
Published: (2024)
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
by: Cheng, Stephen, et al.
Published: (2026)
by: Cheng, Stephen, et al.
Published: (2026)
ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models
by: Guang, Jiahui, et al.
Published: (2026)
by: Guang, Jiahui, et al.
Published: (2026)
LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss
by: Rodriguez, Pau, et al.
Published: (2025)
by: Rodriguez, Pau, et al.
Published: (2025)
Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs
by: Siddique, Zara, et al.
Published: (2025)
by: Siddique, Zara, et al.
Published: (2025)
CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering
by: Lv, Hang, et al.
Published: (2025)
by: Lv, Hang, et al.
Published: (2025)
Extracting Unlearned Information from LLMs with Activation Steering
by: Seyitoğlu, Atakan, et al.
Published: (2024)
by: Seyitoğlu, Atakan, et al.
Published: (2024)
The Llama 3 Herd of Models
by: Grattafiori, Aaron, et al.
Published: (2024)
by: Grattafiori, Aaron, et al.
Published: (2024)
Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction
by: Xu, Ziyang, et al.
Published: (2024)
by: Xu, Ziyang, et al.
Published: (2024)
EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering
by: Xu, Haolei, et al.
Published: (2025)
by: Xu, Haolei, et al.
Published: (2025)
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
by: Shen, Yikang, et al.
Published: (2024)
by: Shen, Yikang, et al.
Published: (2024)
Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering
by: Valentino, Marco, et al.
Published: (2025)
by: Valentino, Marco, et al.
Published: (2025)
Llama-Nemotron: Efficient Reasoning Models
by: Bercovich, Akhiad, et al.
Published: (2025)
by: Bercovich, Akhiad, et al.
Published: (2025)
Extending Activation Steering to Broad Skills and Multiple Behaviours
by: van der Weij, Teun, et al.
Published: (2024)
by: van der Weij, Teun, et al.
Published: (2024)
How Prevalent is Gender Bias in ChatGPT? -- Exploring German and English ChatGPT Responses
by: Urchs, Stefanie, et al.
Published: (2023)
by: Urchs, Stefanie, et al.
Published: (2023)
Similar Items
-
Steering Llama 2 via Contrastive Activation Addition
by: Panickssery, Nina, et al.
Published: (2023) -
AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs
by: Fathullah, Yassir, et al.
Published: (2023) -
SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs
by: Siu, Vincent, et al.
Published: (2025) -
Steering Awareness: Detecting Activation Steering from Within
by: Rivera, Joshua Fonseca, et al.
Published: (2025) -
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
by: Kang, Diancheng, et al.
Published: (2026)