Saved in:
| Main Authors: | Shen, Hongyuan, Zheng, Min, Wang, Jincheng, Zhao, Yang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.20952 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Understanding Data Importance in Machine Learning Attacks: Does Valuable Data Pose Greater Harm?
by: Wen, Rui, et al.
Published: (2024)
by: Wen, Rui, et al.
Published: (2024)
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
by: Shen, Xinyue, et al.
Published: (2023)
by: Shen, Xinyue, et al.
Published: (2023)
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
by: Chao, Patrick, et al.
Published: (2024)
by: Chao, Patrick, et al.
Published: (2024)
JULI: Jailbreak Large Language Models by Self-Introspection
by: Wang, Jesson, et al.
Published: (2025)
by: Wang, Jesson, et al.
Published: (2025)
Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models
by: Wang, Xiangwen, et al.
Published: (2026)
by: Wang, Xiangwen, et al.
Published: (2026)
Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models
by: Li, Songze, et al.
Published: (2026)
by: Li, Songze, et al.
Published: (2026)
Model-Editing-Based Jailbreak against Safety-aligned Large Language Models
by: Li, Yuxi, et al.
Published: (2024)
by: Li, Yuxi, et al.
Published: (2024)
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
by: Jia, Xiaojun, et al.
Published: (2024)
by: Jia, Xiaojun, et al.
Published: (2024)
Jailbreaking Large Language Models in Infinitely Many Ways
by: Goldstein, Oliver, et al.
Published: (2025)
by: Goldstein, Oliver, et al.
Published: (2025)
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
by: Li, Xuan, et al.
Published: (2023)
by: Li, Xuan, et al.
Published: (2023)
Voice Jailbreak Attacks Against GPT-4o
by: Shen, Xinyue, et al.
Published: (2024)
by: Shen, Xinyue, et al.
Published: (2024)
TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis
by: Wang, Longtian, et al.
Published: (2025)
by: Wang, Longtian, et al.
Published: (2025)
FlashDP: Private Training Large Language Models with Efficient DP-SGD
by: Wang, Liangyu, et al.
Published: (2025)
by: Wang, Liangyu, et al.
Published: (2025)
Digger: Detecting Copyright Content Mis-usage in Large Language Model Training
by: Li, Haodong, et al.
Published: (2024)
by: Li, Haodong, et al.
Published: (2024)
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models
by: Peng, Benji, et al.
Published: (2024)
by: Peng, Benji, et al.
Published: (2024)
Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design
by: Eshuijs, Leon, et al.
Published: (2026)
by: Eshuijs, Leon, et al.
Published: (2026)
Mitigating Error Amplification in Fast Adversarial Training
by: Zhao, Mengnan, et al.
Published: (2026)
by: Zhao, Mengnan, et al.
Published: (2026)
Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning
by: Liu, Guozhi, et al.
Published: (2025)
by: Liu, Guozhi, et al.
Published: (2025)
TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering
by: Thornton, Scott
Published: (2026)
by: Thornton, Scott
Published: (2026)
Differentially Private Subspace Fine-Tuning for Large Language Models
by: Zheng, Lele, et al.
Published: (2026)
by: Zheng, Lele, et al.
Published: (2026)
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
by: Reddy, Aashray, et al.
Published: (2025)
by: Reddy, Aashray, et al.
Published: (2025)
Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs
by: Chen, Wenyu, et al.
Published: (2026)
by: Chen, Wenyu, et al.
Published: (2026)
Label Privacy in Split Learning for Large Models with Parameter-Efficient Training
by: Zmushko, Philip, et al.
Published: (2024)
by: Zmushko, Philip, et al.
Published: (2024)
Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence
by: Fu, Shaopeng, et al.
Published: (2025)
by: Fu, Shaopeng, et al.
Published: (2025)
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
by: Chu, Junjie, et al.
Published: (2024)
by: Chu, Junjie, et al.
Published: (2024)
VERA: Variational Inference Framework for Jailbreaking Large Language Models
by: Lochab, Anamika, et al.
Published: (2025)
by: Lochab, Anamika, et al.
Published: (2025)
Fuzz-Testing Meets LLM-Based Agents: An Automated and Efficient Framework for Jailbreaking Text-To-Image Generation Models
by: Dong, Yingkai, et al.
Published: (2024)
by: Dong, Yingkai, et al.
Published: (2024)
Extracting Spatiotemporal Data from Gradients with Large Language Models
by: Zheng, Lele, et al.
Published: (2024)
by: Zheng, Lele, et al.
Published: (2024)
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models
by: Liang, Siyuan, et al.
Published: (2025)
by: Liang, Siyuan, et al.
Published: (2025)
AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models
by: Chen, Guangke, et al.
Published: (2025)
by: Chen, Guangke, et al.
Published: (2025)
EnJa: Ensemble Jailbreak on Large Language Models
by: Zhang, Jiahao, et al.
Published: (2024)
by: Zhang, Jiahao, et al.
Published: (2024)
Improved Large Language Model Jailbreak Detection via Pretrained Embeddings
by: Galinkin, Erick, et al.
Published: (2024)
by: Galinkin, Erick, et al.
Published: (2024)
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos
by: Yao, Yang, et al.
Published: (2025)
by: Yao, Yang, et al.
Published: (2025)
Sovereign Context Protocol: An Open Attribution Layer for Human-Generated Content in the Age of Large Language Models
by: Panchigar, Praneel, et al.
Published: (2026)
by: Panchigar, Praneel, et al.
Published: (2026)
Understanding and Enhancing the Transferability of Jailbreaking Attacks
by: Lin, Runqi, et al.
Published: (2025)
by: Lin, Runqi, et al.
Published: (2025)
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
by: Huang, Tiansheng, et al.
Published: (2024)
by: Huang, Tiansheng, et al.
Published: (2024)
PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips
by: Coalson, Zachary, et al.
Published: (2024)
by: Coalson, Zachary, et al.
Published: (2024)
A Fast, Performant, Secure Distributed Training Framework For Large Language Model
by: Huang, Wei, et al.
Published: (2024)
by: Huang, Wei, et al.
Published: (2024)
The Surprising Harmfulness of Benign Overfitting for Adversarial Robustness
by: Hao, Yifan, et al.
Published: (2024)
by: Hao, Yifan, et al.
Published: (2024)
GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection
by: Rad, Melissa Kazemi, et al.
Published: (2025)
by: Rad, Melissa Kazemi, et al.
Published: (2025)
Similar Items
-
Understanding Data Importance in Machine Learning Attacks: Does Valuable Data Pose Greater Harm?
by: Wen, Rui, et al.
Published: (2024) -
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
by: Shen, Xinyue, et al.
Published: (2023) -
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
by: Chao, Patrick, et al.
Published: (2024) -
JULI: Jailbreak Large Language Models by Self-Introspection
by: Wang, Jesson, et al.
Published: (2025) -
Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models
by: Wang, Xiangwen, et al.
Published: (2026)