Saved in:
| Main Authors: | Zhao, Jiachen, Huang, Jing, Wu, Zhengxuan, Bau, David, Shi, Weiyan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.11878 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Should LLM Safety Be More Than Refusing Harmful Instructions?
by: Maskey, Utsav, et al.
Published: (2025)
by: Maskey, Utsav, et al.
Published: (2025)
Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs
by: Yuan, Shuzhou, et al.
Published: (2025)
by: Yuan, Shuzhou, et al.
Published: (2025)
Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding
by: Joo, Seongho, et al.
Published: (2025)
by: Joo, Seongho, et al.
Published: (2025)
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
by: An, Bang, et al.
Published: (2024)
by: An, Bang, et al.
Published: (2024)
Refusal in LLMs is an Affine Function
by: Marshall, Thomas, et al.
Published: (2024)
by: Marshall, Thomas, et al.
Published: (2024)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
by: Yuan, Youliang, et al.
Published: (2024)
by: Yuan, Youliang, et al.
Published: (2024)
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
by: Huang, Jing, et al.
Published: (2024)
by: Huang, Jing, et al.
Published: (2024)
Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs
by: Maskey, Utsav, et al.
Published: (2026)
by: Maskey, Utsav, et al.
Published: (2026)
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
by: Mazeika, Mantas, et al.
Published: (2024)
by: Mazeika, Mantas, et al.
Published: (2024)
Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior
by: Si, Shengyun, et al.
Published: (2025)
by: Si, Shengyun, et al.
Published: (2025)
RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs
by: Nguyen, Tuan T., et al.
Published: (2025)
by: Nguyen, Tuan T., et al.
Published: (2025)
DataPuzzle: Breaking Free from the Hallucinated Promise of LLMs in Data Analysis
by: Zhang, Zhengxuan, et al.
Published: (2025)
by: Zhang, Zhengxuan, et al.
Published: (2025)
Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks
by: Pan, Wenbo, et al.
Published: (2025)
by: Pan, Wenbo, et al.
Published: (2025)
Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies
by: Wu, Zhengxuan, et al.
Published: (2022)
by: Wu, Zhengxuan, et al.
Published: (2022)
Can Editing LLMs Inject Harm?
by: Chen, Canyu, et al.
Published: (2024)
by: Chen, Canyu, et al.
Published: (2024)
Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment
by: Li, Jie, et al.
Published: (2024)
by: Li, Jie, et al.
Published: (2024)
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
by: Feucht, Sheridan, et al.
Published: (2024)
by: Feucht, Sheridan, et al.
Published: (2024)
Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare
by: Ahsan, Hiba, et al.
Published: (2025)
by: Ahsan, Hiba, et al.
Published: (2025)
Silenced Biases: The Dark Side LLMs Learned to Refuse
by: Himelstein, Rom, et al.
Published: (2025)
by: Himelstein, Rom, et al.
Published: (2025)
Learning to Refuse: Towards Mitigating Privacy Risks in LLMs
by: Liu, Zhenhua, et al.
Published: (2024)
by: Liu, Zhenhua, et al.
Published: (2024)
ReCOGS: How Incidental Details of a Logical Form Overshadow an Evaluation of Semantic Interpretation
by: Wu, Zhengxuan, et al.
Published: (2023)
by: Wu, Zhengxuan, et al.
Published: (2023)
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
by: Zeng, Yi, et al.
Published: (2024)
by: Zeng, Yi, et al.
Published: (2024)
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
by: von Recum, Alexander, et al.
Published: (2024)
by: von Recum, Alexander, et al.
Published: (2024)
Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks
by: Ham, Seokil, et al.
Published: (2025)
by: Ham, Seokil, et al.
Published: (2025)
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
by: Wu, Zhengxuan, et al.
Published: (2025)
by: Wu, Zhengxuan, et al.
Published: (2025)
Do Prevalent Bias Metrics Capture Allocational Harms from LLMs?
by: Cyberey, Hannah, et al.
Published: (2024)
by: Cyberey, Hannah, et al.
Published: (2024)
Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
by: García-Ferrero, Iker, et al.
Published: (2025)
by: García-Ferrero, Iker, et al.
Published: (2025)
Bridging Cognition and Emotion: Empathy-Driven Multimodal Misinformation Detection
by: Wang, Zihan, et al.
Published: (2025)
by: Wang, Zihan, et al.
Published: (2025)
Do Reasoning LLMs Refuse What They Infer in Long Contexts?
by: Fu, Yu, et al.
Published: (2026)
by: Fu, Yu, et al.
Published: (2026)
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
by: Atil, Berk, et al.
Published: (2025)
by: Atil, Berk, et al.
Published: (2025)
Does Refusal Training in LLMs Generalize to the Past Tense?
by: Andriushchenko, Maksym, et al.
Published: (2024)
by: Andriushchenko, Maksym, et al.
Published: (2024)
Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse
by: Song, Maojia, et al.
Published: (2024)
by: Song, Maojia, et al.
Published: (2024)
Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights
by: Choi, Sooyung, et al.
Published: (2025)
by: Choi, Sooyung, et al.
Published: (2025)
PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm
by: Li, Jing-Jing, et al.
Published: (2026)
by: Li, Jing-Jing, et al.
Published: (2026)
Safety Alignment of Large Language Models via Contrasting Safe and Harmful Distributions
by: Zhang, Xiaoyun, et al.
Published: (2024)
by: Zhang, Xiaoyun, et al.
Published: (2024)
Vector Arithmetic in Concept and Token Subspaces
by: Feucht, Sheridan, et al.
Published: (2025)
by: Feucht, Sheridan, et al.
Published: (2025)
Don't Say No: Jailbreaking LLM by Suppressing Refusal
by: Zhou, Yukai, et al.
Published: (2024)
by: Zhou, Yukai, et al.
Published: (2024)
Locating and Editing Factual Associations in Mamba
by: Sharma, Arnab Sen, et al.
Published: (2024)
by: Sharma, Arnab Sen, et al.
Published: (2024)
SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering
by: Maskey, Utsav, et al.
Published: (2025)
by: Maskey, Utsav, et al.
Published: (2025)
Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
by: Zhang, Chiyu, et al.
Published: (2025)
by: Zhang, Chiyu, et al.
Published: (2025)
Similar Items
-
Should LLM Safety Be More Than Refusing Harmful Instructions?
by: Maskey, Utsav, et al.
Published: (2025) -
Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs
by: Yuan, Shuzhou, et al.
Published: (2025) -
Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding
by: Joo, Seongho, et al.
Published: (2025) -
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
by: An, Bang, et al.
Published: (2024) -
Refusal in LLMs is an Affine Function
by: Marshall, Thomas, et al.
Published: (2024)