Saved in:
| Main Authors: | Wu, Jiaqi, Chen, Chen, Hou, Chunyan, Yuan, Xiaojie |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.15594 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
by: Xu, Zhangchen, et al.
Published: (2024)
by: Xu, Zhangchen, et al.
Published: (2024)
Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing
by: Zhao, Yinzhi, et al.
Published: (2026)
by: Zhao, Yinzhi, et al.
Published: (2026)
TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models
by: Xu, Zhi, et al.
Published: (2026)
by: Xu, Zhi, et al.
Published: (2026)
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
by: Ji, Jiabao, et al.
Published: (2024)
by: Ji, Jiabao, et al.
Published: (2024)
JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
by: Feng, Yingchaojie, et al.
Published: (2024)
by: Feng, Yingchaojie, et al.
Published: (2024)
Distract Large Language Models for Automatic Jailbreak Attack
by: Xiao, Zeguan, et al.
Published: (2024)
by: Xiao, Zeguan, et al.
Published: (2024)
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
by: Huang, Caishuang, et al.
Published: (2024)
by: Huang, Caishuang, et al.
Published: (2024)
Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models
by: Tu, Shangqing, et al.
Published: (2024)
by: Tu, Shangqing, et al.
Published: (2024)
Jailbreaking Large Language Models with Morality Attacks
by: Su, Ying, et al.
Published: (2026)
by: Su, Ying, et al.
Published: (2026)
Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models
by: Kadali, Sri Durga Sai Sowmya, et al.
Published: (2026)
by: Kadali, Sri Durga Sai Sowmya, et al.
Published: (2026)
Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models
by: Miao, Ziqi, et al.
Published: (2025)
by: Miao, Ziqi, et al.
Published: (2025)
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
by: Zhu, Junda, et al.
Published: (2025)
by: Zhu, Junda, et al.
Published: (2025)
SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks
by: Cao, Hongye, et al.
Published: (2025)
by: Cao, Hongye, et al.
Published: (2025)
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models
by: Gao, Lang, et al.
Published: (2024)
by: Gao, Lang, et al.
Published: (2024)
Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks
by: Zhao, Jiawei, et al.
Published: (2024)
by: Zhao, Jiawei, et al.
Published: (2024)
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
by: Shen, Guobin, et al.
Published: (2024)
by: Shen, Guobin, et al.
Published: (2024)
Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation
by: Feng, Bo-Han, et al.
Published: (2026)
by: Feng, Bo-Han, et al.
Published: (2026)
AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
by: Shu, Dong, et al.
Published: (2024)
by: Shu, Dong, et al.
Published: (2024)
Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models
by: Cui, Chenhang, et al.
Published: (2024)
by: Cui, Chenhang, et al.
Published: (2024)
Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models
by: Sun, Xiaobing, et al.
Published: (2026)
by: Sun, Xiaobing, et al.
Published: (2026)
CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models
by: Zhou, Guanghao, et al.
Published: (2025)
by: Zhou, Guanghao, et al.
Published: (2025)
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
by: Mu, Honglin, et al.
Published: (2024)
by: Mu, Honglin, et al.
Published: (2024)
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
by: Oh, Sejoon, et al.
Published: (2024)
by: Oh, Sejoon, et al.
Published: (2024)
Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge
by: Lu, Weikai, et al.
Published: (2024)
by: Lu, Weikai, et al.
Published: (2024)
Jailbreaking Attack against Multimodal Large Language Model
by: Niu, Zhenxing, et al.
Published: (2024)
by: Niu, Zhenxing, et al.
Published: (2024)
Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective
by: Li, Tianlong, et al.
Published: (2024)
by: Li, Tianlong, et al.
Published: (2024)
Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models
by: Xu, Zihao, et al.
Published: (2024)
by: Xu, Zihao, et al.
Published: (2024)
ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models
by: Cheng, Siyang, et al.
Published: (2025)
by: Cheng, Siyang, et al.
Published: (2025)
Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation
by: Zhang, Yuanhe, et al.
Published: (2025)
by: Zhang, Yuanhe, et al.
Published: (2025)
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models
by: Jiang, Lei, et al.
Published: (2025)
by: Jiang, Lei, et al.
Published: (2025)
Safety Alignment of Large Language Models via Contrasting Safe and Harmful Distributions
by: Zhang, Xiaoyun, et al.
Published: (2024)
by: Zhang, Xiaoyun, et al.
Published: (2024)
Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?
by: Xin, Yuan, et al.
Published: (2025)
by: Xin, Yuan, et al.
Published: (2025)
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
by: Zhang, Zhexin, et al.
Published: (2023)
by: Zhang, Zhexin, et al.
Published: (2023)
Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models
by: Pernisi, Fabio, et al.
Published: (2024)
by: Pernisi, Fabio, et al.
Published: (2024)
ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs
by: Ni, Ziyi, et al.
Published: (2025)
by: Ni, Ziyi, et al.
Published: (2025)
Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models
by: Dong, Yiting, et al.
Published: (2024)
by: Dong, Yiting, et al.
Published: (2024)
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
by: Yu, Miao, et al.
Published: (2024)
by: Yu, Miao, et al.
Published: (2024)
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
by: Zhang, Junbo, et al.
Published: (2025)
by: Zhang, Junbo, et al.
Published: (2025)
ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models
by: Zhang, Hengxiang, et al.
Published: (2024)
by: Zhang, Hengxiang, et al.
Published: (2024)
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
by: Zhao, Shiji, et al.
Published: (2025)
by: Zhao, Shiji, et al.
Published: (2025)
Similar Items
-
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
by: Xu, Zhangchen, et al.
Published: (2024) -
Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing
by: Zhao, Yinzhi, et al.
Published: (2026) -
TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models
by: Xu, Zhi, et al.
Published: (2026) -
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
by: Ji, Jiabao, et al.
Published: (2024) -
JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
by: Feng, Yingchaojie, et al.
Published: (2024)