Saved in:
| Main Authors: | Berg, Cameron, Lulla, Roshni |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.09773 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
"Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior
by: Lulla, Roshni, et al.
Published: (2026)
by: Lulla, Roshni, et al.
Published: (2026)
Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings
by: Das, Nilanjana, et al.
Published: (2026)
by: Das, Nilanjana, et al.
Published: (2026)
To Tell The Truth: Language of Deception and Language Models
by: Hazra, Sanchaita, et al.
Published: (2023)
by: Hazra, Sanchaita, et al.
Published: (2023)
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
by: Zhou, Hanhan, et al.
Published: (2026)
by: Zhou, Hanhan, et al.
Published: (2026)
Self-Steering Language Models
by: Grand, Gabriel, et al.
Published: (2025)
by: Grand, Gabriel, et al.
Published: (2025)
An Assessment of Model-On-Model Deception
by: Heitkoetter, Julius, et al.
Published: (2024)
by: Heitkoetter, Julius, et al.
Published: (2024)
Causal Language Control in Multilingual Transformers via Sparse Feature Steering
by: Chou, Cheng-Ting, et al.
Published: (2025)
by: Chou, Cheng-Ting, et al.
Published: (2025)
Steering When Necessary: Flexible Steering Large Language Models with Backtracking
by: Cheng, Zifeng, et al.
Published: (2025)
by: Cheng, Zifeng, et al.
Published: (2025)
Interpretable Steering of Large Language Models with Feature Guided Activation Additions
by: Soo, Samuel, et al.
Published: (2025)
by: Soo, Samuel, et al.
Published: (2025)
Seamless Deception: Larger Language Models Are Better Knowledge Concealers
by: Ashok, Dhananjay, et al.
Published: (2026)
by: Ashok, Dhananjay, et al.
Published: (2026)
On the Limitations of Steering in Language Model Alignment
by: Niranjan, Chebrolu, et al.
Published: (2025)
by: Niranjan, Chebrolu, et al.
Published: (2025)
Deception Abilities Emerged in Large Language Models
by: Hagendorff, Thilo
Published: (2023)
by: Hagendorff, Thilo
Published: (2023)
That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation
by: Bae, Jaesung, et al.
Published: (2025)
by: Bae, Jaesung, et al.
Published: (2025)
The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning
by: Merrill, Scott, et al.
Published: (2026)
by: Merrill, Scott, et al.
Published: (2026)
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
by: Lermen, Simon, et al.
Published: (2025)
by: Lermen, Simon, et al.
Published: (2025)
Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language Models
by: Guo, Linge
Published: (2024)
by: Guo, Linge
Published: (2024)
Compromising Honesty and Harmlessness in Language Models via Deception Attacks
by: Vaugrante, Laurène, et al.
Published: (2025)
by: Vaugrante, Laurène, et al.
Published: (2025)
Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models
by: Deng, Ruixuan, et al.
Published: (2025)
by: Deng, Ruixuan, et al.
Published: (2025)
Compositional Steering of Large Language Models with Steering Tokens
by: Radevski, Gorjan, et al.
Published: (2026)
by: Radevski, Gorjan, et al.
Published: (2026)
Disentangling Exploration of Large Language Models by Optimal Exploitation
by: Grams, Tim, et al.
Published: (2025)
by: Grams, Tim, et al.
Published: (2025)
Probing and Steering Evaluation Awareness of Language Models
by: Nguyen, Jord, et al.
Published: (2025)
by: Nguyen, Jord, et al.
Published: (2025)
Activation Scaling for Steering and Interpreting Language Models
by: Stoehr, Niklas, et al.
Published: (2024)
by: Stoehr, Niklas, et al.
Published: (2024)
Detecting Deceptive Dark Patterns in E-commerce Platforms
by: Ramteke, Arya, et al.
Published: (2024)
by: Ramteke, Arya, et al.
Published: (2024)
From Deception to Detection: The Dual Roles of Large Language Models in Fake News
by: Sallami, Dorsaf, et al.
Published: (2024)
by: Sallami, Dorsaf, et al.
Published: (2024)
Too Big to Fool: Resisting Deception in Language Models
by: Samsami, Mohammad Reza, et al.
Published: (2024)
by: Samsami, Mohammad Reza, et al.
Published: (2024)
CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models
by: Wang, Xintong, et al.
Published: (2024)
by: Wang, Xintong, et al.
Published: (2024)
Revealing Algorithmic Deductive Circuits for Logical Reasoning
by: Nguyen, Phuong Minh, et al.
Published: (2026)
by: Nguyen, Phuong Minh, et al.
Published: (2026)
The Subject of Emergent Misalignment in Superintelligence: An Anthropological, Cognitive Neuropsychological, Machine-Learning, and Ontological Perspective
by: Imran, Muhammad Osama, et al.
Published: (2025)
by: Imran, Muhammad Osama, et al.
Published: (2025)
Cross-Lingual Activation Steering for Multilingual Language Models
by: Pokharel, Rhitabrat, et al.
Published: (2026)
by: Pokharel, Rhitabrat, et al.
Published: (2026)
Steering Large Language Models to Evaluate and Amplify Creativity
by: Olson, Matthew Lyle, et al.
Published: (2024)
by: Olson, Matthew Lyle, et al.
Published: (2024)
Prompt-Based Value Steering of Large Language Models
by: Abbo, Giulio Antonio, et al.
Published: (2025)
by: Abbo, Giulio Antonio, et al.
Published: (2025)
LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models
by: Olson, Matthew Lyle, et al.
Published: (2026)
by: Olson, Matthew Lyle, et al.
Published: (2026)
Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers
by: Dumas, Clément, et al.
Published: (2024)
by: Dumas, Clément, et al.
Published: (2024)
Exploring the Personality Traits of LLMs through Latent Features Steering
by: Yang, Shu, et al.
Published: (2024)
by: Yang, Shu, et al.
Published: (2024)
Steering Language Models Before They Speak: Logit-Level Interventions
by: An, Hyeseon, et al.
Published: (2026)
by: An, Hyeseon, et al.
Published: (2026)
DLM-SWAI: Steering Diffusion Language Models Before They Unmask
by: An, Hyeseon, et al.
Published: (2026)
by: An, Hyeseon, et al.
Published: (2026)
Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering
by: Chen, Zixin, et al.
Published: (2025)
by: Chen, Zixin, et al.
Published: (2025)
Word Embeddings Are Steers for Language Models
by: Han, Chi, et al.
Published: (2023)
by: Han, Chi, et al.
Published: (2023)
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
by: Hua, Tim Tian, et al.
Published: (2025)
by: Hua, Tim Tian, et al.
Published: (2025)
RepIt: Steering Language Models with Concept-Specific Refusal Vectors
by: Siu, Vincent, et al.
Published: (2025)
by: Siu, Vincent, et al.
Published: (2025)
Similar Items
-
"Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior
by: Lulla, Roshni, et al.
Published: (2026) -
Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings
by: Das, Nilanjana, et al.
Published: (2026) -
To Tell The Truth: Language of Deception and Language Models
by: Hazra, Sanchaita, et al.
Published: (2023) -
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
by: Zhou, Hanhan, et al.
Published: (2026) -
Self-Steering Language Models
by: Grand, Gabriel, et al.
Published: (2025)