Saved in:
| Main Authors: | Xue, Zhiyu, Qi, Zimo, Liu, Guangliang, Chen, Bocheng, Pedarsani, Ramtin |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.11388 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning
by: Xue, Zhiyu, et al.
Published: (2024)
by: Xue, Zhiyu, et al.
Published: (2024)
Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations
by: Xue, Zhiyu, et al.
Published: (2025)
by: Xue, Zhiyu, et al.
Published: (2025)
Conflict-Aware Adversarial Training
by: Xue, Zhiyu, et al.
Published: (2024)
by: Xue, Zhiyu, et al.
Published: (2024)
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
by: Pan, Licheng, et al.
Published: (2025)
by: Pan, Licheng, et al.
Published: (2025)
Inverse Reinforcement Learning by Estimating Expertise of Demonstrators
by: Beliaev, Mark, et al.
Published: (2024)
by: Beliaev, Mark, et al.
Published: (2024)
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior
by: Si, Shengyun, et al.
Published: (2025)
by: Si, Shengyun, et al.
Published: (2025)
LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment
by: Zhang, Haonan, et al.
Published: (2026)
by: Zhang, Haonan, et al.
Published: (2026)
Altruistic Maneuver Planning for Cooperative Autonomous Vehicles Using Multi-agent Advantage Actor-Critic
by: Toghi, Behrad, et al.
Published: (2021)
by: Toghi, Behrad, et al.
Published: (2021)
Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
by: Campbell, David, et al.
Published: (2026)
by: Campbell, David, et al.
Published: (2026)
Learning to Refuse: Towards Mitigating Privacy Risks in LLMs
by: Liu, Zhenhua, et al.
Published: (2024)
by: Liu, Zhenhua, et al.
Published: (2024)
Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?
by: Yin, Qingyu, et al.
Published: (2025)
by: Yin, Qingyu, et al.
Published: (2025)
Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models
by: Duan, Ranjie, et al.
Published: (2025)
by: Duan, Ranjie, et al.
Published: (2025)
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
by: Chae, Kyubyung, et al.
Published: (2025)
by: Chae, Kyubyung, et al.
Published: (2025)
Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off
by: Chen, Yu, et al.
Published: (2026)
by: Chen, Yu, et al.
Published: (2026)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
by: Xie, Tinghao, et al.
Published: (2024)
by: Xie, Tinghao, et al.
Published: (2024)
Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models
by: Parihar, Shweta, et al.
Published: (2026)
by: Parihar, Shweta, et al.
Published: (2026)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
by: Yuan, Youliang, et al.
Published: (2024)
by: Yuan, Youliang, et al.
Published: (2024)
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning
by: Zhang, Zhehao, et al.
Published: (2025)
by: Zhang, Zhehao, et al.
Published: (2025)
SPEX: Scaling Feature Interaction Explanations for LLMs
by: Kang, Justin Singh, et al.
Published: (2025)
by: Kang, Justin Singh, et al.
Published: (2025)
From Refusal Tokens to Refusal Control: Discovering and Steering Category-Specific Refusal Directions
by: Alagharu, Rishab, et al.
Published: (2026)
by: Alagharu, Rishab, et al.
Published: (2026)
ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal
by: Zhang, Haonan, et al.
Published: (2025)
by: Zhang, Haonan, et al.
Published: (2025)
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models
by: Liu, Qin, et al.
Published: (2024)
by: Liu, Qin, et al.
Published: (2024)
The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs
by: Deng, Yonghong, et al.
Published: (2026)
by: Deng, Yonghong, et al.
Published: (2026)
The Safety-Privacy Tradeoff in Linear Bandits
by: Zibaie, Arghavan, et al.
Published: (2025)
by: Zibaie, Arghavan, et al.
Published: (2025)
The impact of multi-agent debate protocols on debate quality: a controlled case study
by: Marandi, Ramtin Zargari
Published: (2026)
by: Marandi, Ramtin Zargari
Published: (2026)
When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals
by: Anonto, Riad Ahmed, et al.
Published: (2025)
by: Anonto, Riad Ahmed, et al.
Published: (2025)
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
by: Chen, Jianhui, et al.
Published: (2024)
by: Chen, Jianhui, et al.
Published: (2024)
LoRA is All You Need for Safety Alignment of Reasoning LLMs
by: Xue, Yihao, et al.
Published: (2025)
by: Xue, Yihao, et al.
Published: (2025)
LatentRefusal: Latent-Signal Refusal for Unanswerable Text-to-SQL Queries
by: Ren, Xuancheng, et al.
Published: (2026)
by: Ren, Xuancheng, et al.
Published: (2026)
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions
by: Wu, Xiaorui, et al.
Published: (2025)
by: Wu, Xiaorui, et al.
Published: (2025)
AlignKT: Explicitly Modeling Knowledge State for Knowledge Tracing with Ideal State Alignment
by: Xiao, Jing, et al.
Published: (2025)
by: Xiao, Jing, et al.
Published: (2025)
MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems
by: Chen, Kai, et al.
Published: (2025)
by: Chen, Kai, et al.
Published: (2025)
Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning
by: Dabas, Mahavir, et al.
Published: (2025)
by: Dabas, Mahavir, et al.
Published: (2025)
Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability
by: Chhabra, Vishnu Kabir, et al.
Published: (2025)
by: Chhabra, Vishnu Kabir, et al.
Published: (2025)
The Refusal--Compliance Tradeoff: A Large-Scale Safety Behavior Audit of Large Language Models
by: Hasan, Alif Al, et al.
Published: (2026)
by: Hasan, Alif Al, et al.
Published: (2026)
Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context
by: Zhang, Zhihao, et al.
Published: (2026)
by: Zhang, Zhihao, et al.
Published: (2026)
Communication-Efficient and Tensorized Federated Fine-Tuning of Large Language Models
by: Ghiasvand, Sajjad, et al.
Published: (2024)
by: Ghiasvand, Sajjad, et al.
Published: (2024)
MoralReason: Generalizable Moral Decision Alignment For LLM Agents Using Reasoning-Level Reinforcement Learning
by: An, Zhiyu, et al.
Published: (2025)
by: An, Zhiyu, et al.
Published: (2025)
TreePrompt: Leveraging Hierarchical Few-Shot Example Selection for Improved English-Persian and English-German Translation
by: Kakavand, Ramtin, et al.
Published: (2025)
by: Kakavand, Ramtin, et al.
Published: (2025)
Open Problems in Differentiable Social Choice: Learning Mechanisms, Decisions, and Alignment
by: An, Zhiyu, et al.
Published: (2026)
by: An, Zhiyu, et al.
Published: (2026)
Similar Items
-
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning
by: Xue, Zhiyu, et al.
Published: (2024) -
Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations
by: Xue, Zhiyu, et al.
Published: (2025) -
Conflict-Aware Adversarial Training
by: Xue, Zhiyu, et al.
Published: (2024) -
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
by: Pan, Licheng, et al.
Published: (2025) -
Inverse Reinforcement Learning by Estimating Expertise of Demonstrators
by: Beliaev, Mark, et al.
Published: (2024)