Saved in:
| Main Authors: | Shu, Huizhen, Li, Xuying, Li, Zhuo |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.19839 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
The Resurgence of GCG Adversarial Attacks on Large Language Models
by: Tan, Yuting, et al.
Published: (2025)
by: Tan, Yuting, et al.
Published: (2025)
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
by: Shu, Huizhen, et al.
Published: (2025)
by: Shu, Huizhen, et al.
Published: (2025)
LatentRefusal: Latent-Signal Refusal for Unanswerable Text-to-SQL Queries
by: Ren, Xuancheng, et al.
Published: (2026)
by: Ren, Xuancheng, et al.
Published: (2026)
Latent-space Attacks for Refusal Evasion in Language Models
by: Piras, Giorgio, et al.
Published: (2026)
by: Piras, Giorgio, et al.
Published: (2026)
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
by: Hu, Xulin, et al.
Published: (2026)
by: Hu, Xulin, et al.
Published: (2026)
From Refusal Tokens to Refusal Control: Discovering and Steering Category-Specific Refusal Directions
by: Alagharu, Rishab, et al.
Published: (2026)
by: Alagharu, Rishab, et al.
Published: (2026)
Exploring the Personality Traits of LLMs through Latent Features Steering
by: Yang, Shu, et al.
Published: (2024)
by: Yang, Shu, et al.
Published: (2024)
Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation
by: Li, Xuying, et al.
Published: (2024)
by: Li, Xuying, et al.
Published: (2024)
Unveiling and Steering Connectome Organization with Interpretable Latent Variables
by: Li, Yubin, et al.
Published: (2025)
by: Li, Yubin, et al.
Published: (2025)
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
by: García-Ferrero, Iker, et al.
Published: (2025)
by: García-Ferrero, Iker, et al.
Published: (2025)
Steer LLM Latents for Hallucination Detection
by: Park, Seongheon, et al.
Published: (2025)
by: Park, Seongheon, et al.
Published: (2025)
Latent Guard: a Safety Framework for Text-to-image Generation
by: Liu, Runtao, et al.
Published: (2024)
by: Liu, Runtao, et al.
Published: (2024)
Programming Refusal with Conditional Activation Steering
by: Lee, Bruce W., et al.
Published: (2024)
by: Lee, Bruce W., et al.
Published: (2024)
Controllable Mathematical Reasoning via Self-Optimizing Thought Vectors
by: LI, Xuying
Published: (2025)
by: LI, Xuying
Published: (2025)
Output Length Effect on DeepSeek-R1's Safety in Forced Thinking
by: Li, Xuying, et al.
Published: (2025)
by: Li, Xuying, et al.
Published: (2025)
Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts
by: Bhargav, Samaksh, et al.
Published: (2025)
by: Bhargav, Samaksh, et al.
Published: (2025)
RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering
by: Ye, Wencheng, et al.
Published: (2026)
by: Ye, Wencheng, et al.
Published: (2026)
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
by: Sheng, Leheng, et al.
Published: (2025)
by: Sheng, Leheng, et al.
Published: (2025)
Latent Space Disentanglement via Activation Steering for Interpretable Attribute Control in Symbolic Music Generation
by: Prokopiou, Ioannis, et al.
Published: (2026)
by: Prokopiou, Ioannis, et al.
Published: (2026)
Transferable Latent-to-Latent Locomotion Policy for Efficient and Versatile Motion Control of Diverse Legged Robots
by: Zheng, Ziang, et al.
Published: (2025)
by: Zheng, Ziang, et al.
Published: (2025)
Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism
by: Cao, Lang
Published: (2023)
by: Cao, Lang
Published: (2023)
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
by: Cheng, Stephen, et al.
Published: (2026)
by: Cheng, Stephen, et al.
Published: (2026)
Spatial-Aware Latent Initialization for Controllable Image Generation
by: Sun, Wenqiang, et al.
Published: (2024)
by: Sun, Wenqiang, et al.
Published: (2024)
Preemptive Detection and Steering of LLM Misalignment via Latent Reachability
by: Karnik, Sathwik, et al.
Published: (2025)
by: Karnik, Sathwik, et al.
Published: (2025)
On Effects of Steering Latent Representation for Large Language Model Unlearning
by: Huu-Tien, Dang, et al.
Published: (2024)
by: Huu-Tien, Dang, et al.
Published: (2024)
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs
by: Li, Jiakang, et al.
Published: (2026)
by: Li, Jiakang, et al.
Published: (2026)
RepIt: Steering Language Models with Concept-Specific Refusal Vectors
by: Siu, Vincent, et al.
Published: (2025)
by: Siu, Vincent, et al.
Published: (2025)
Latent Action Control for Reasoning-Guided Unified Image Generation
by: Zhai, Fuxiang, et al.
Published: (2026)
by: Zhai, Fuxiang, et al.
Published: (2026)
Controllable and Stealthy Shilling Attacks via Dispersive Latent Diffusion
by: Qiao, Shutong, et al.
Published: (2025)
by: Qiao, Shutong, et al.
Published: (2025)
Precision Knowledge Editing: Enhancing Safety in Large Language Models
by: Li, Xuying, et al.
Published: (2024)
by: Li, Xuying, et al.
Published: (2024)
Latent Policy Steering with Embodiment-Agnostic Pretrained World Models
by: Wang, Yiqi, et al.
Published: (2025)
by: Wang, Yiqi, et al.
Published: (2025)
Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
by: Yang, Kia-Jüng, et al.
Published: (2026)
by: Yang, Kia-Jüng, et al.
Published: (2026)
Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models
by: Yang, Jiaxi, et al.
Published: (2026)
by: Yang, Jiaxi, et al.
Published: (2026)
Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization
by: Egbuna, Nathan, et al.
Published: (2025)
by: Egbuna, Nathan, et al.
Published: (2025)
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
by: Liu, Andy Zeyi, et al.
Published: (2026)
by: Liu, Andy Zeyi, et al.
Published: (2026)
Learning Latent Dynamic Robust Representations for World Models
by: Sun, Ruixiang, et al.
Published: (2024)
by: Sun, Ruixiang, et al.
Published: (2024)
FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding
by: Yang, Jinghan, et al.
Published: (2026)
by: Yang, Jinghan, et al.
Published: (2026)
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
by: Liu, Sheng, et al.
Published: (2023)
by: Liu, Sheng, et al.
Published: (2023)
Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation
by: Zhang, Wentao, et al.
Published: (2026)
by: Zhang, Wentao, et al.
Published: (2026)
Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis
by: Qiu, Kai, et al.
Published: (2025)
by: Qiu, Kai, et al.
Published: (2025)
Similar Items
-
The Resurgence of GCG Adversarial Attacks on Large Language Models
by: Tan, Yuting, et al.
Published: (2025) -
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
by: Shu, Huizhen, et al.
Published: (2025) -
LatentRefusal: Latent-Signal Refusal for Unanswerable Text-to-SQL Queries
by: Ren, Xuancheng, et al.
Published: (2026) -
Latent-space Attacks for Refusal Evasion in Language Models
by: Piras, Giorgio, et al.
Published: (2026) -
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
by: Hu, Xulin, et al.
Published: (2026)