:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yan, Bo, Lin, Weikai, Zhu, Yada, Wang, Song
Format:	Preprint
Published:	2026
Subjects:	Cryptography and Security Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.16824
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning
by: Yang, Xianglin, et al.
Published: (2025)

SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents
by: Liang, Siyuan, et al.
Published: (2025)

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
by: Xu, Zhangchen, et al.
Published: (2024)

AISA: Awakening Intrinsic Safety Awareness in Large Language Models against Jailbreak Attacks
by: Song, Weiming, et al.
Published: (2026)

JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
by: Nian, Yi, et al.
Published: (2025)

Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models
by: Yang, Fan
Published: (2025)

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling
by: Wang, Ziwei, et al.
Published: (2026)

Re-Triggering Safeguards within LLMs for Jailbreak Detection
by: Lin, Zheng, et al.
Published: (2026)

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads
by: Wu, Jinman, et al.
Published: (2026)

SDD: Self-Degraded Defense against Malicious Fine-tuning
by: Chen, Zixuan, et al.
Published: (2025)

Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem
by: Lin, Shuyi, et al.
Published: (2025)

Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models
by: Hong, Wenjing, et al.
Published: (2026)

PCDiff: Proactive Control for Ownership Protection in Diffusion Models with Watermark Compatibility
by: Gai, Keke, et al.
Published: (2025)

Proactive Detection of Physical Inter-rule Vulnerabilities in IoT Services Using a Deep Learning Approach
by: Huang, Bing, et al.
Published: (2024)

Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation
by: Zhang, Wenhui, et al.
Published: (2025)

Proactively Detecting Threats: A Novel Approach Using LLMs
by: Chawla, Aniesh, et al.
Published: (2026)

SoK: Evaluating Jailbreak Guardrails for Large Language Models
by: Wang, Xunguang, et al.
Published: (2025)

SoK: Robustness in Large Language Models against Jailbreak Attacks
by: Xu, Feiyue, et al.
Published: (2026)

SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
by: Zhou, Kaiwen, et al.
Published: (2025)

Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs
by: Yan, Yu, et al.
Published: (2025)

SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models
by: Fang, Junfeng, et al.
Published: (2025)

Untargeted Jailbreak Attack
by: Huang, Xinzhe, et al.
Published: (2025)

Proactive Detection of Voice Cloning with Localized Watermarking
by: Roman, Robin San, et al.
Published: (2024)

NegBLEURT Forest: Leveraging Inconsistencies for Detecting Jailbreak Attacks
by: Sleem, Lama, et al.
Published: (2025)

DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
by: Wang, Yi, et al.
Published: (2025)

Breaking Minds, Breaking Systems: Jailbreaking Large Language Models via Human-like Psychological Manipulation
by: Liu, Zehao, et al.
Published: (2025)

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models
by: Peng, Benji, et al.
Published: (2024)

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning
by: Wang, Zhaoqi, et al.
Published: (2025)

SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment
by: Fang, Xianya, et al.
Published: (2026)

Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
by: Lin, Zheng, et al.
Published: (2026)

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
by: Wang, Hao, et al.
Published: (2026)

Emoji-Based Jailbreaking of Large Language Models
by: Gopinadh, M P V S, et al.
Published: (2026)

Defending against Jailbreak through Early Exit Generation of Large Language Models
by: Zhao, Chongwen, et al.
Published: (2024)

PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs
by: Gong, Xueluan, et al.
Published: (2024)

The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
by: Wu, Zihui, et al.
Published: (2024)

BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
by: Zeng, Yi, et al.
Published: (2024)

Coward: Collision-based OOD Watermarking for Practical Proactive Federated Backdoor Detection
by: Li, Wenjie, et al.
Published: (2025)

ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack
by: Lin, Xingwei, et al.
Published: (2026)

LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges
by: Li, Haoyang, et al.
Published: (2025)

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
by: Andriushchenko, Maksym, et al.
Published: (2024)