:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Hsu, Yu-Ling, Su, Hsuan, Chen, Shang-Tse
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence Cryptography and Security Machine Learning
Online Access:	https://arxiv.org/abs/2502.01154
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
by: Reddy, Aashray, et al.
Published: (2025)

Fight Back Against Jailbreaking via Prompt Adversarial Tuning
by: Mo, Yichuan, et al.
Published: (2024)

PromptScreen: Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline
by: Rao, Akshaj Prashanth, et al.
Published: (2025)

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
by: Li, Jing-Jing, et al.
Published: (2025)

SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
by: Saiem, Bijoy Ahmed, et al.
Published: (2024)

Universal Jailbreak Backdoors from Poisoned Human Feedback
by: Rando, Javier, et al.
Published: (2023)

KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs
by: Liang, Buyun, et al.
Published: (2025)

Jailbreaking in the Haystack
by: Shah, Rishi Rajesh, et al.
Published: (2025)

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
by: Rando, Javier, et al.
Published: (2024)

A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy
by: Correia, Pedro H. Barcha, et al.
Published: (2026)

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
by: Chu, Junjie, et al.
Published: (2024)

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
by: Sharma, Mrinank, et al.
Published: (2025)

Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
by: Hu, Xiaomeng, et al.
Published: (2024)

EnJa: Ensemble Jailbreak on Large Language Models
by: Zhang, Jiahao, et al.
Published: (2024)

Jailbreaking LLMs via Calibration
by: Lu, Yuxuan, et al.
Published: (2026)

Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval
by: Chen, Taiye, et al.
Published: (2025)

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
by: Hua, Peichun, et al.
Published: (2025)

AdvPrefix: An Objective for Nuanced LLM Jailbreaks
by: Zhu, Sicheng, et al.
Published: (2024)

Rethinking How to Evaluate Language Model Jailbreak
by: Cai, Hongyu, et al.
Published: (2024)

Low-Resource Languages Jailbreak GPT-4
by: Yong, Zheng-Xin, et al.
Published: (2023)

Jailbreaking Large Language Models with Symbolic Mathematics
by: Bethany, Emet, et al.
Published: (2024)

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
by: Hu, Xulin, et al.
Published: (2026)

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
by: Mehrotra, Anay, et al.
Published: (2023)

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
by: Fang, Zhicheng, et al.
Published: (2026)

HSF: Defending against Jailbreak Attacks with Hidden State Filtering
by: Qian, Cheng, et al.
Published: (2024)

MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?
by: Wahed, Muntasir, et al.
Published: (2025)

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
by: Wei, Zeming, et al.
Published: (2023)

Jailbreak Attacks and Defenses Against Large Language Models: A Survey
by: Yi, Sibo, et al.
Published: (2024)

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
by: Rahman, Salman, et al.
Published: (2025)

MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation
by: Jiang, Weisen, et al.
Published: (2025)

A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos
by: Yao, Yang, et al.
Published: (2025)

An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks
by: Boreiko, Valentyn, et al.
Published: (2024)

Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
by: Zheng, Xiaosen, et al.
Published: (2024)

Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
by: Hasan, Adib, et al.
Published: (2024)

Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation
by: Chen, Sixu, et al.
Published: (2026)

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization
by: Geng, Runpeng, et al.
Published: (2025)

LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
by: Lin, Shi, et al.
Published: (2024)

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses
by: Ahmed, Mohamed, et al.
Published: (2025)

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
by: Li, Xiao, et al.
Published: (2024)

TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice
by: Goel, Aman, et al.
Published: (2025)