:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhao, Jiachen, Huang, Jing, Wu, Zhengxuan, Bau, David, Shi, Weiyan
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2507.11878
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Should LLM Safety Be More Than Refusing Harmful Instructions?
by: Maskey, Utsav, et al.
Published: (2025)

Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs
by: Yuan, Shuzhou, et al.
Published: (2025)

Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding
by: Joo, Seongho, et al.
Published: (2025)

Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
by: An, Bang, et al.
Published: (2024)

Refusal in LLMs is an Affine Function
by: Marshall, Thomas, et al.
Published: (2024)

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
by: Yuan, Youliang, et al.
Published: (2024)

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
by: Huang, Jing, et al.
Published: (2024)

Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs
by: Maskey, Utsav, et al.
Published: (2026)

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
by: Mazeika, Mantas, et al.
Published: (2024)

Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior
by: Si, Shengyun, et al.
Published: (2025)

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs
by: Nguyen, Tuan T., et al.
Published: (2025)

DataPuzzle: Breaking Free from the Hallucinated Promise of LLMs in Data Analysis
by: Zhang, Zhengxuan, et al.
Published: (2025)

Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks
by: Pan, Wenbo, et al.
Published: (2025)

Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies
by: Wu, Zhengxuan, et al.
Published: (2022)

Can Editing LLMs Inject Harm?
by: Chen, Canyu, et al.
Published: (2024)

Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment
by: Li, Jie, et al.
Published: (2024)

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
by: Feucht, Sheridan, et al.
Published: (2024)

Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare
by: Ahsan, Hiba, et al.
Published: (2025)

Silenced Biases: The Dark Side LLMs Learned to Refuse
by: Himelstein, Rom, et al.
Published: (2025)

Learning to Refuse: Towards Mitigating Privacy Risks in LLMs
by: Liu, Zhenhua, et al.
Published: (2024)

ReCOGS: How Incidental Details of a Logical Form Overshadow an Evaluation of Semantic Interpretation
by: Wu, Zhengxuan, et al.
Published: (2023)

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
by: Zeng, Yi, et al.
Published: (2024)

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
by: von Recum, Alexander, et al.
Published: (2024)

Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks
by: Ham, Seokil, et al.
Published: (2025)

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
by: Wu, Zhengxuan, et al.
Published: (2025)

Do Prevalent Bias Metrics Capture Allocational Harms from LLMs?
by: Cyberey, Hannah, et al.
Published: (2024)

Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
by: García-Ferrero, Iker, et al.
Published: (2025)

Bridging Cognition and Emotion: Empathy-Driven Multimodal Misinformation Detection
by: Wang, Zihan, et al.
Published: (2025)

Do Reasoning LLMs Refuse What They Infer in Long Contexts?
by: Fu, Yu, et al.
Published: (2026)

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
by: Atil, Berk, et al.
Published: (2025)

Does Refusal Training in LLMs Generalize to the Past Tense?
by: Andriushchenko, Maksym, et al.
Published: (2024)

Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse
by: Song, Maojia, et al.
Published: (2024)

Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights
by: Choi, Sooyung, et al.
Published: (2025)

PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm
by: Li, Jing-Jing, et al.
Published: (2026)

Safety Alignment of Large Language Models via Contrasting Safe and Harmful Distributions
by: Zhang, Xiaoyun, et al.
Published: (2024)

Vector Arithmetic in Concept and Token Subspaces
by: Feucht, Sheridan, et al.
Published: (2025)

Don't Say No: Jailbreaking LLM by Suppressing Refusal
by: Zhou, Yukai, et al.
Published: (2024)

Locating and Editing Factual Associations in Mamba
by: Sharma, Arnab Sen, et al.
Published: (2024)

SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering
by: Maskey, Utsav, et al.
Published: (2025)

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
by: Zhang, Chiyu, et al.
Published: (2025)