:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Han, Peixuan, Qian, Cheng, Chen, Xiusi, Zhang, Yuji, Ji, Heng, Zhang, Denghui
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2502.01042
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents
by: Qian, Cheng, et al.
Published: (2024)

RM-R1: Reward Modeling as Reasoning
by: Chen, Xiusi, et al.
Published: (2025)

ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges
by: Qian, Cheng, et al.
Published: (2025)

ISACL: Internal State Analyzer for Copyrighted Training Data Leakage
by: Zhang, Guangwei, et al.
Published: (2025)

LLMGuard: Guarding Against Unsafe LLM Behavior
by: Goyal, Shubh, et al.
Published: (2024)

DecisionFlow: Advancing Large Language Model as Principled Decision Maker
by: Chen, Xiusi, et al.
Published: (2025)

Disentangling Safe and Unsafe Corruptions via Anisotropy and Locality
by: Muthukumar, Ramchandran, et al.
Published: (2025)

Current Agents Fail to Leverage World Model as Tool for Foresight
by: Qian, Cheng, et al.
Published: (2026)

CBMAS: Cognitive Behavioral Modeling via Activation Steering
by: Ismail, Ahmed H., et al.
Published: (2026)

Angular Steering: Behavior Control via Rotation in Activation Space
by: Vu, Hieu M., et al.
Published: (2025)

Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training
by: Shu, Dong, et al.
Published: (2026)

Safe-Support Q-Learning: Learning without Unsafe Exploration
by: Lim, Yeeun, et al.
Published: (2026)

Attention Shift: Steering AI Away from Unsafe Content
by: Garg, Shivank, et al.
Published: (2024)

ToolRL: Reward is All Tool Learning Needs
by: Qian, Cheng, et al.
Published: (2025)

SMART: Self-Aware Agent for Tool Overuse Mitigation
by: Qian, Cheng, et al.
Published: (2025)

Steered LLM Activations are Non-Surjective
by: Mishra, Aayush, et al.
Published: (2026)

Steer Like the LLM: Activation Steering that Mimics Prompting
by: Heyman, Geert, et al.
Published: (2026)

Decoding the Critique Mechanism in Large Reasoning Models
by: Phan, Hoang, et al.
Published: (2026)

Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
by: Dinh, Mih, et al.
Published: (2026)

No Safe Dose: How Training Data Drives Unsafe Image Generation
by: Friedrich, Felix, et al.
Published: (2026)

Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation
by: Kim, Yongwoo, et al.
Published: (2026)

Activation Steering with a Feedback Controller
by: Nguyen, Dung V., et al.
Published: (2025)

SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs
by: Ghosh, Shaona, et al.
Published: (2025)

The Rogue Scalpel: Activation Steering Compromises LLM Safety
by: Korznikov, Anton, et al.
Published: (2025)

VSPO: Vector-Steered Policy Optimization for Behavioral Control
by: Zhang, Xuechen, et al.
Published: (2026)

ROAST: Rollout-based On-distribution Activation Steering Technique
by: Su, Xuanbo, et al.
Published: (2026)

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing
by: Qian, Cheng, et al.
Published: (2026)

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
by: Zhang, Dongcheng, et al.
Published: (2026)

Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
by: Jin, Zehao, et al.
Published: (2026)

ShieldNN: A Provably Safe NN Filter for Unsafe NN Controllers
by: Ferlez, James, et al.
Published: (2020)

Command-V: Pasting LLM Behaviors via Activation Profiles
by: Wang, Barry, et al.
Published: (2025)

No Free Lunch: Rethinking Internal Feedback for LLM Reasoning
by: Zhang, Yanzhi, et al.
Published: (2025)

BarrierSteer: LLM Safety via Learning Barrier Steering
by: Tran, Thanh Q., et al.
Published: (2026)

Global Evolutionary Steering: Refining Activation Steering Control via Cross-Layer Consistency
by: Jiang, Xinyan, et al.
Published: (2026)

Word Embeddings Are Steers for Language Models
by: Han, Chi, et al.
Published: (2023)

HyperSteer: Activation Steering at Scale with Hypernetworks
by: Sun, Jiuding, et al.
Published: (2025)

DRPG (Decompose, Retrieve, Plan, Generate): An Agentic Framework for Academic Rebuttal
by: Han, Peixuan, et al.
Published: (2026)

Dynamically Scaled Activation Steering
by: Ferrando, Alex, et al.
Published: (2025)

Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks
by: Mohan, Vamshi Sunku, et al.
Published: (2026)

Self-Improving LLM Agents at Test-Time
by: Acikgoz, Emre Can, et al.
Published: (2025)