:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Hao, Li, Hao, Huang, Minlie, Sha, Lei
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2402.16006
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
by: Wang, Xinyuan, et al.
Published: (2024)

DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak
by: Wang, Hao, et al.
Published: (2024)

Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation
by: Kim, Minkyoung, et al.
Published: (2024)

Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
by: Zhang, Zhexin, et al.
Published: (2023)

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?
by: Mu, Junjie, et al.
Published: (2025)

ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs
by: Ni, Ziyi, et al.
Published: (2025)

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
by: Xu, Zhao, et al.
Published: (2024)

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
by: Chen, Yunhao, et al.
Published: (2025)

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
by: Liu, Fan, et al.
Published: (2024)

LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs
by: Li, Ran, et al.
Published: (2025)

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors
by: Zhang, Zhexin, et al.
Published: (2024)

Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation
by: Meng, Wenlong, et al.
Published: (2025)

From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks
by: Zhang, Zhexin, et al.
Published: (2024)

HSF: Defending against Jailbreak Attacks with Hidden State Filtering
by: Qian, Cheng, et al.
Published: (2024)

Harnessing the Plug-and-Play Controller by Prompting
by: Wang, Hao, et al.
Published: (2024)

Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers
by: Lin, Liang, et al.
Published: (2025)

AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
by: Liao, Zeyi, et al.
Published: (2024)

AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts
by: Kumar, Vishal, et al.
Published: (2024)

Activation-Guided Local Editing for Jailbreaking Attacks
by: Wang, Jiecong, et al.
Published: (2025)

AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation
by: Wang, Zijun, et al.
Published: (2024)

Defending LLMs against Jailbreaking Attacks via Backtranslation
by: Wang, Yihan, et al.
Published: (2024)

Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks
by: Noughabi, Havva Alizadeh, et al.
Published: (2025)

AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
by: Zhang, Zhexin, et al.
Published: (2025)

SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis
by: Wong, Aidan, et al.
Published: (2024)

Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
by: Chen, Kexin, et al.
Published: (2024)

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
by: Luo, Xuan, et al.
Published: (2025)

Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding
by: Xiao, Zhongyu, et al.
Published: (2026)

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay
by: Wang, Hao, et al.
Published: (2026)

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
by: Chu, Junjie, et al.
Published: (2024)

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
by: Lin, Yuping, et al.
Published: (2024)

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs
by: Zhou, Yao, et al.
Published: (2026)

DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification
by: Li, Yu, et al.
Published: (2025)

Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
by: Li, Qizhang, et al.
Published: (2024)

COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
by: Guo, Xingang, et al.
Published: (2024)

Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models
by: Tu, Shangqing, et al.
Published: (2024)

Medical MLLM is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
by: Huang, Xijie, et al.
Published: (2024)

Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs
by: Li, Xiang, et al.
Published: (2025)

MiniPLM: Knowledge Distillation for Pre-Training Language Models
by: Gu, Yuxian, et al.
Published: (2024)

Language Models Hallucinate, but May Excel at Fact Verification
by: Guan, Jian, et al.
Published: (2023)

Large Language Models Are Not Robust Multiple Choice Selectors
by: Zheng, Chujie, et al.
Published: (2023)