Saved in:
| Main Authors: | Han, Peixuan, Qian, Cheng, Chen, Xiusi, Zhang, Yuji, Ji, Heng, Zhang, Denghui |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.01042 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents
by: Qian, Cheng, et al.
Published: (2024)
by: Qian, Cheng, et al.
Published: (2024)
RM-R1: Reward Modeling as Reasoning
by: Chen, Xiusi, et al.
Published: (2025)
by: Chen, Xiusi, et al.
Published: (2025)
ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges
by: Qian, Cheng, et al.
Published: (2025)
by: Qian, Cheng, et al.
Published: (2025)
ISACL: Internal State Analyzer for Copyrighted Training Data Leakage
by: Zhang, Guangwei, et al.
Published: (2025)
by: Zhang, Guangwei, et al.
Published: (2025)
LLMGuard: Guarding Against Unsafe LLM Behavior
by: Goyal, Shubh, et al.
Published: (2024)
by: Goyal, Shubh, et al.
Published: (2024)
DecisionFlow: Advancing Large Language Model as Principled Decision Maker
by: Chen, Xiusi, et al.
Published: (2025)
by: Chen, Xiusi, et al.
Published: (2025)
Disentangling Safe and Unsafe Corruptions via Anisotropy and Locality
by: Muthukumar, Ramchandran, et al.
Published: (2025)
by: Muthukumar, Ramchandran, et al.
Published: (2025)
Current Agents Fail to Leverage World Model as Tool for Foresight
by: Qian, Cheng, et al.
Published: (2026)
by: Qian, Cheng, et al.
Published: (2026)
CBMAS: Cognitive Behavioral Modeling via Activation Steering
by: Ismail, Ahmed H., et al.
Published: (2026)
by: Ismail, Ahmed H., et al.
Published: (2026)
Angular Steering: Behavior Control via Rotation in Activation Space
by: Vu, Hieu M., et al.
Published: (2025)
by: Vu, Hieu M., et al.
Published: (2025)
Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training
by: Shu, Dong, et al.
Published: (2026)
by: Shu, Dong, et al.
Published: (2026)
Safe-Support Q-Learning: Learning without Unsafe Exploration
by: Lim, Yeeun, et al.
Published: (2026)
by: Lim, Yeeun, et al.
Published: (2026)
Attention Shift: Steering AI Away from Unsafe Content
by: Garg, Shivank, et al.
Published: (2024)
by: Garg, Shivank, et al.
Published: (2024)
ToolRL: Reward is All Tool Learning Needs
by: Qian, Cheng, et al.
Published: (2025)
by: Qian, Cheng, et al.
Published: (2025)
SMART: Self-Aware Agent for Tool Overuse Mitigation
by: Qian, Cheng, et al.
Published: (2025)
by: Qian, Cheng, et al.
Published: (2025)
Steered LLM Activations are Non-Surjective
by: Mishra, Aayush, et al.
Published: (2026)
by: Mishra, Aayush, et al.
Published: (2026)
Steer Like the LLM: Activation Steering that Mimics Prompting
by: Heyman, Geert, et al.
Published: (2026)
by: Heyman, Geert, et al.
Published: (2026)
Decoding the Critique Mechanism in Large Reasoning Models
by: Phan, Hoang, et al.
Published: (2026)
by: Phan, Hoang, et al.
Published: (2026)
Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
by: Dinh, Mih, et al.
Published: (2026)
by: Dinh, Mih, et al.
Published: (2026)
No Safe Dose: How Training Data Drives Unsafe Image Generation
by: Friedrich, Felix, et al.
Published: (2026)
by: Friedrich, Felix, et al.
Published: (2026)
Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation
by: Kim, Yongwoo, et al.
Published: (2026)
by: Kim, Yongwoo, et al.
Published: (2026)
Activation Steering with a Feedback Controller
by: Nguyen, Dung V., et al.
Published: (2025)
by: Nguyen, Dung V., et al.
Published: (2025)
SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs
by: Ghosh, Shaona, et al.
Published: (2025)
by: Ghosh, Shaona, et al.
Published: (2025)
The Rogue Scalpel: Activation Steering Compromises LLM Safety
by: Korznikov, Anton, et al.
Published: (2025)
by: Korznikov, Anton, et al.
Published: (2025)
VSPO: Vector-Steered Policy Optimization for Behavioral Control
by: Zhang, Xuechen, et al.
Published: (2026)
by: Zhang, Xuechen, et al.
Published: (2026)
ROAST: Rollout-based On-distribution Activation Steering Technique
by: Su, Xuanbo, et al.
Published: (2026)
by: Su, Xuanbo, et al.
Published: (2026)
CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing
by: Qian, Cheng, et al.
Published: (2026)
by: Qian, Cheng, et al.
Published: (2026)
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
by: Zhang, Dongcheng, et al.
Published: (2026)
by: Zhang, Dongcheng, et al.
Published: (2026)
Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
by: Jin, Zehao, et al.
Published: (2026)
by: Jin, Zehao, et al.
Published: (2026)
ShieldNN: A Provably Safe NN Filter for Unsafe NN Controllers
by: Ferlez, James, et al.
Published: (2020)
by: Ferlez, James, et al.
Published: (2020)
Command-V: Pasting LLM Behaviors via Activation Profiles
by: Wang, Barry, et al.
Published: (2025)
by: Wang, Barry, et al.
Published: (2025)
No Free Lunch: Rethinking Internal Feedback for LLM Reasoning
by: Zhang, Yanzhi, et al.
Published: (2025)
by: Zhang, Yanzhi, et al.
Published: (2025)
BarrierSteer: LLM Safety via Learning Barrier Steering
by: Tran, Thanh Q., et al.
Published: (2026)
by: Tran, Thanh Q., et al.
Published: (2026)
Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency
by: Jiang, Xinyan, et al.
Published: (2026)
by: Jiang, Xinyan, et al.
Published: (2026)
Word Embeddings Are Steers for Language Models
by: Han, Chi, et al.
Published: (2023)
by: Han, Chi, et al.
Published: (2023)
HyperSteer: Activation Steering at Scale with Hypernetworks
by: Sun, Jiuding, et al.
Published: (2025)
by: Sun, Jiuding, et al.
Published: (2025)
DRPG (Decompose, Retrieve, Plan, Generate): An Agentic Framework for Academic Rebuttal
by: Han, Peixuan, et al.
Published: (2026)
by: Han, Peixuan, et al.
Published: (2026)
Dynamically Scaled Activation Steering
by: Ferrando, Alex, et al.
Published: (2025)
by: Ferrando, Alex, et al.
Published: (2025)
Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
by: Mohan, Vamshi Sunku, et al.
Published: (2026)
by: Mohan, Vamshi Sunku, et al.
Published: (2026)
Self-Improving LLM Agents at Test-Time
by: Acikgoz, Emre Can, et al.
Published: (2025)
by: Acikgoz, Emre Can, et al.
Published: (2025)
Similar Items
-
EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents
by: Qian, Cheng, et al.
Published: (2024) -
RM-R1: Reward Modeling as Reasoning
by: Chen, Xiusi, et al.
Published: (2025) -
ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges
by: Qian, Cheng, et al.
Published: (2025) -
ISACL: Internal State Analyzer for Copyrighted Training Data Leakage
by: Zhang, Guangwei, et al.
Published: (2025) -
LLMGuard: Guarding Against Unsafe LLM Behavior
by: Goyal, Shubh, et al.
Published: (2024)