Saved in:
| Main Authors: | Hsu, Yu-Ling, Su, Hsuan, Chen, Shang-Tse |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.01154 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
by: Reddy, Aashray, et al.
Published: (2025)
by: Reddy, Aashray, et al.
Published: (2025)
Fight Back Against Jailbreaking via Prompt Adversarial Tuning
by: Mo, Yichuan, et al.
Published: (2024)
by: Mo, Yichuan, et al.
Published: (2024)
PromptScreen: Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline
by: Rao, Akshaj Prashanth, et al.
Published: (2025)
by: Rao, Akshaj Prashanth, et al.
Published: (2025)
STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
by: Li, Jing-Jing, et al.
Published: (2025)
by: Li, Jing-Jing, et al.
Published: (2025)
SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
by: Saiem, Bijoy Ahmed, et al.
Published: (2024)
by: Saiem, Bijoy Ahmed, et al.
Published: (2024)
Universal Jailbreak Backdoors from Poisoned Human Feedback
by: Rando, Javier, et al.
Published: (2023)
by: Rando, Javier, et al.
Published: (2023)
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs
by: Liang, Buyun, et al.
Published: (2025)
by: Liang, Buyun, et al.
Published: (2025)
Jailbreaking in the Haystack
by: Shah, Rishi Rajesh, et al.
Published: (2025)
by: Shah, Rishi Rajesh, et al.
Published: (2025)
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
by: Rando, Javier, et al.
Published: (2024)
by: Rando, Javier, et al.
Published: (2024)
A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy
by: Correia, Pedro H. Barcha, et al.
Published: (2026)
by: Correia, Pedro H. Barcha, et al.
Published: (2026)
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
by: Chu, Junjie, et al.
Published: (2024)
by: Chu, Junjie, et al.
Published: (2024)
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
by: Sharma, Mrinank, et al.
Published: (2025)
by: Sharma, Mrinank, et al.
Published: (2025)
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
by: Hu, Xiaomeng, et al.
Published: (2024)
by: Hu, Xiaomeng, et al.
Published: (2024)
EnJa: Ensemble Jailbreak on Large Language Models
by: Zhang, Jiahao, et al.
Published: (2024)
by: Zhang, Jiahao, et al.
Published: (2024)
Jailbreaking LLMs via Calibration
by: Lu, Yuxuan, et al.
Published: (2026)
by: Lu, Yuxuan, et al.
Published: (2026)
Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval
by: Chen, Taiye, et al.
Published: (2025)
by: Chen, Taiye, et al.
Published: (2025)
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
by: Hua, Peichun, et al.
Published: (2025)
by: Hua, Peichun, et al.
Published: (2025)
AdvPrefix: An Objective for Nuanced LLM Jailbreaks
by: Zhu, Sicheng, et al.
Published: (2024)
by: Zhu, Sicheng, et al.
Published: (2024)
Rethinking How to Evaluate Language Model Jailbreak
by: Cai, Hongyu, et al.
Published: (2024)
by: Cai, Hongyu, et al.
Published: (2024)
Low-Resource Languages Jailbreak GPT-4
by: Yong, Zheng-Xin, et al.
Published: (2023)
by: Yong, Zheng-Xin, et al.
Published: (2023)
Jailbreaking Large Language Models with Symbolic Mathematics
by: Bethany, Emet, et al.
Published: (2024)
by: Bethany, Emet, et al.
Published: (2024)
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
by: Hu, Xulin, et al.
Published: (2026)
by: Hu, Xulin, et al.
Published: (2026)
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
by: Mehrotra, Anay, et al.
Published: (2023)
by: Mehrotra, Anay, et al.
Published: (2023)
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
by: Fang, Zhicheng, et al.
Published: (2026)
by: Fang, Zhicheng, et al.
Published: (2026)
HSF: Defending against Jailbreak Attacks with Hidden State Filtering
by: Qian, Cheng, et al.
Published: (2024)
by: Qian, Cheng, et al.
Published: (2024)
MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?
by: Wahed, Muntasir, et al.
Published: (2025)
by: Wahed, Muntasir, et al.
Published: (2025)
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
by: Wei, Zeming, et al.
Published: (2023)
by: Wei, Zeming, et al.
Published: (2023)
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
by: Yi, Sibo, et al.
Published: (2024)
by: Yi, Sibo, et al.
Published: (2024)
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
by: Rahman, Salman, et al.
Published: (2025)
by: Rahman, Salman, et al.
Published: (2025)
MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation
by: Jiang, Weisen, et al.
Published: (2025)
by: Jiang, Weisen, et al.
Published: (2025)
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos
by: Yao, Yang, et al.
Published: (2025)
by: Yao, Yang, et al.
Published: (2025)
An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks
by: Boreiko, Valentyn, et al.
Published: (2024)
by: Boreiko, Valentyn, et al.
Published: (2024)
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
by: Zheng, Xiaosen, et al.
Published: (2024)
by: Zheng, Xiaosen, et al.
Published: (2024)
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
by: Hasan, Adib, et al.
Published: (2024)
by: Hasan, Adib, et al.
Published: (2024)
Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation
by: Chen, Sixu, et al.
Published: (2026)
by: Chen, Sixu, et al.
Published: (2026)
PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization
by: Geng, Runpeng, et al.
Published: (2025)
by: Geng, Runpeng, et al.
Published: (2025)
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
by: Lin, Shi, et al.
Published: (2024)
by: Lin, Shi, et al.
Published: (2024)
Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses
by: Ahmed, Mohamed, et al.
Published: (2025)
by: Ahmed, Mohamed, et al.
Published: (2025)
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
by: Li, Xiao, et al.
Published: (2024)
by: Li, Xiao, et al.
Published: (2024)
TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice
by: Goel, Aman, et al.
Published: (2025)
by: Goel, Aman, et al.
Published: (2025)
Similar Items
-
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
by: Reddy, Aashray, et al.
Published: (2025) -
Fight Back Against Jailbreaking via Prompt Adversarial Tuning
by: Mo, Yichuan, et al.
Published: (2024) -
PromptScreen: Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline
by: Rao, Akshaj Prashanth, et al.
Published: (2025) -
STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
by: Li, Jing-Jing, et al.
Published: (2025) -
SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
by: Saiem, Bijoy Ahmed, et al.
Published: (2024)