Saved in:
| Main Authors: | Yan, Bo, Lin, Weikai, Zhu, Yada, Wang, Song |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.16824 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning
by: Yang, Xianglin, et al.
Published: (2025)
by: Yang, Xianglin, et al.
Published: (2025)
SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents
by: Liang, Siyuan, et al.
Published: (2025)
by: Liang, Siyuan, et al.
Published: (2025)
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
by: Xu, Zhangchen, et al.
Published: (2024)
by: Xu, Zhangchen, et al.
Published: (2024)
AISA: Awakening Intrinsic Safety Awareness in Large Language Models against Jailbreak Attacks
by: Song, Weiming, et al.
Published: (2026)
by: Song, Weiming, et al.
Published: (2026)
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
by: Nian, Yi, et al.
Published: (2025)
by: Nian, Yi, et al.
Published: (2025)
Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models
by: Yang, Fan
Published: (2025)
by: Yang, Fan
Published: (2025)
Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling
by: Wang, Ziwei, et al.
Published: (2026)
by: Wang, Ziwei, et al.
Published: (2026)
Re-Triggering Safeguards within LLMs for Jailbreak Detection
by: Lin, Zheng, et al.
Published: (2026)
by: Lin, Zheng, et al.
Published: (2026)
Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads
by: Wu, Jinman, et al.
Published: (2026)
by: Wu, Jinman, et al.
Published: (2026)
SDD: Self-Degraded Defense against Malicious Fine-tuning
by: Chen, Zixuan, et al.
Published: (2025)
by: Chen, Zixuan, et al.
Published: (2025)
Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem
by: Lin, Shuyi, et al.
Published: (2025)
by: Lin, Shuyi, et al.
Published: (2025)
Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models
by: Hong, Wenjing, et al.
Published: (2026)
by: Hong, Wenjing, et al.
Published: (2026)
PCDiff: Proactive Control for Ownership Protection in Diffusion Models with Watermark Compatibility
by: Gai, Keke, et al.
Published: (2025)
by: Gai, Keke, et al.
Published: (2025)
Proactive Detection of Physical Inter-rule Vulnerabilities in IoT Services Using a Deep Learning Approach
by: Huang, Bing, et al.
Published: (2024)
by: Huang, Bing, et al.
Published: (2024)
Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation
by: Zhang, Wenhui, et al.
Published: (2025)
by: Zhang, Wenhui, et al.
Published: (2025)
Proactively Detecting Threats: A Novel Approach Using LLMs
by: Chawla, Aniesh, et al.
Published: (2026)
by: Chawla, Aniesh, et al.
Published: (2026)
SoK: Evaluating Jailbreak Guardrails for Large Language Models
by: Wang, Xunguang, et al.
Published: (2025)
by: Wang, Xunguang, et al.
Published: (2025)
SoK: Robustness in Large Language Models against Jailbreak Attacks
by: Xu, Feiyue, et al.
Published: (2026)
by: Xu, Feiyue, et al.
Published: (2026)
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
by: Zhou, Kaiwen, et al.
Published: (2025)
by: Zhou, Kaiwen, et al.
Published: (2025)
Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs
by: Yan, Yu, et al.
Published: (2025)
by: Yan, Yu, et al.
Published: (2025)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models
by: Fang, Junfeng, et al.
Published: (2025)
by: Fang, Junfeng, et al.
Published: (2025)
Untargeted Jailbreak Attack
by: Huang, Xinzhe, et al.
Published: (2025)
by: Huang, Xinzhe, et al.
Published: (2025)
Proactive Detection of Voice Cloning with Localized Watermarking
by: Roman, Robin San, et al.
Published: (2024)
by: Roman, Robin San, et al.
Published: (2024)
NegBLEURT Forest: Leveraging Inconsistencies for Detecting Jailbreak Attacks
by: Sleem, Lama, et al.
Published: (2025)
by: Sleem, Lama, et al.
Published: (2025)
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
by: Wang, Yi, et al.
Published: (2025)
by: Wang, Yi, et al.
Published: (2025)
Breaking Minds, Breaking Systems: Jailbreaking Large Language Models via Human-like Psychological Manipulation
by: Liu, Zehao, et al.
Published: (2025)
by: Liu, Zehao, et al.
Published: (2025)
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models
by: Peng, Benji, et al.
Published: (2024)
by: Peng, Benji, et al.
Published: (2024)
Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning
by: Wang, Zhaoqi, et al.
Published: (2025)
by: Wang, Zhaoqi, et al.
Published: (2025)
SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment
by: Fang, Xianya, et al.
Published: (2026)
by: Fang, Xianya, et al.
Published: (2026)
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
by: Lin, Zheng, et al.
Published: (2026)
by: Lin, Zheng, et al.
Published: (2026)
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
by: Wang, Hao, et al.
Published: (2026)
by: Wang, Hao, et al.
Published: (2026)
Emoji-Based Jailbreaking of Large Language Models
by: Gopinadh, M P V S, et al.
Published: (2026)
by: Gopinadh, M P V S, et al.
Published: (2026)
Defending against Jailbreak through Early Exit Generation of Large Language Models
by: Zhao, Chongwen, et al.
Published: (2024)
by: Zhao, Chongwen, et al.
Published: (2024)
PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs
by: Gong, Xueluan, et al.
Published: (2024)
by: Gong, Xueluan, et al.
Published: (2024)
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
by: Wu, Zihui, et al.
Published: (2024)
by: Wu, Zihui, et al.
Published: (2024)
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
by: Zeng, Yi, et al.
Published: (2024)
by: Zeng, Yi, et al.
Published: (2024)
Coward: Collision-based OOD Watermarking for Practical Proactive Federated Backdoor Detection
by: Li, Wenjie, et al.
Published: (2025)
by: Li, Wenjie, et al.
Published: (2025)
ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack
by: Lin, Xingwei, et al.
Published: (2026)
by: Lin, Xingwei, et al.
Published: (2026)
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges
by: Li, Haoyang, et al.
Published: (2025)
by: Li, Haoyang, et al.
Published: (2025)
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
by: Andriushchenko, Maksym, et al.
Published: (2024)
by: Andriushchenko, Maksym, et al.
Published: (2024)
Similar Items
-
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning
by: Yang, Xianglin, et al.
Published: (2025) -
SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents
by: Liang, Siyuan, et al.
Published: (2025) -
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
by: Xu, Zhangchen, et al.
Published: (2024) -
AISA: Awakening Intrinsic Safety Awareness in Large Language Models against Jailbreak Attacks
by: Song, Weiming, et al.
Published: (2026) -
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
by: Nian, Yi, et al.
Published: (2025)