Saved in:
| Main Author: | Yang, Fan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.10091 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
by: Guo, Weiyang, et al.
Published: (2025)
by: Guo, Weiyang, et al.
Published: (2025)
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
by: Ferrand, Jean-Charles Noirot, et al.
Published: (2025)
by: Ferrand, Jean-Charles Noirot, et al.
Published: (2025)
Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks
by: Saha, Shoumik, et al.
Published: (2025)
by: Saha, Shoumik, et al.
Published: (2025)
On the Feasibility of Using MultiModal LLMs to Execute AR Social Engineering Attacks
by: Bi, Ting, et al.
Published: (2025)
by: Bi, Ting, et al.
Published: (2025)
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack
by: Li, Hao, et al.
Published: (2026)
by: Li, Hao, et al.
Published: (2026)
Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment
by: Kim, Jaehan, et al.
Published: (2025)
by: Kim, Jaehan, et al.
Published: (2025)
BESA: Boosting Encoder Stealing Attack with Perturbation Recovery
by: Ren, Xuhao, et al.
Published: (2025)
by: Ren, Xuhao, et al.
Published: (2025)
Reimagining Safety Alignment with An Image
by: Xia, Yifan, et al.
Published: (2025)
by: Xia, Yifan, et al.
Published: (2025)
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment
by: Ghosal, Soumya Suvra, et al.
Published: (2024)
by: Ghosal, Soumya Suvra, et al.
Published: (2024)
Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
by: Li, Xurui, et al.
Published: (2025)
by: Li, Xurui, et al.
Published: (2025)
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
by: Andriushchenko, Maksym, et al.
Published: (2024)
by: Andriushchenko, Maksym, et al.
Published: (2024)
Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs
by: Zhao, Shuai, et al.
Published: (2024)
by: Zhao, Shuai, et al.
Published: (2024)
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
by: Chae, Kyubyung, et al.
Published: (2025)
by: Chae, Kyubyung, et al.
Published: (2025)
Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models
by: Yang, Fan
Published: (2025)
by: Yang, Fan
Published: (2025)
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In
by: Nakash, Itay, et al.
Published: (2024)
by: Nakash, Itay, et al.
Published: (2024)
Agent Safety Alignment via Reinforcement Learning
by: Sha, Zeyang, et al.
Published: (2025)
by: Sha, Zeyang, et al.
Published: (2025)
A Method for Enhancing the Safety of Large Model Generation Based on Multi-dimensional Attack and Defense
by: Zhai, Keke
Published: (2024)
by: Zhai, Keke
Published: (2024)
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
by: Cao, Bochuan, et al.
Published: (2023)
by: Cao, Bochuan, et al.
Published: (2023)
Comprehensive Botnet Detection by Mitigating Adversarial Attacks, Navigating the Subtleties of Perturbation Distances and Fortifying Predictions with Conformal Layers
by: Yumlembam, Rahul, et al.
Published: (2024)
by: Yumlembam, Rahul, et al.
Published: (2024)
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
by: Xu, Zhao, et al.
Published: (2024)
by: Xu, Zhao, et al.
Published: (2024)
Quantifying the Noise of Structural Perturbations on Graph Adversarial Attacks
by: Fang, Junyuan, et al.
Published: (2025)
by: Fang, Junyuan, et al.
Published: (2025)
BitAbuse: A Dataset of Visually Perturbed Texts for Defending Phishing Attacks
by: Lee, Hanyong, et al.
Published: (2025)
by: Lee, Hanyong, et al.
Published: (2025)
Scam Shield: Multi-Model Voting and Fine-Tuned LLMs Against Adversarial Attacks
by: Chang, Chen-Wei, et al.
Published: (2025)
by: Chang, Chen-Wei, et al.
Published: (2025)
Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks
by: Tong, Haibo, et al.
Published: (2025)
by: Tong, Haibo, et al.
Published: (2025)
Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction
by: Wang, Hongtao, et al.
Published: (2026)
by: Wang, Hongtao, et al.
Published: (2026)
AttackSeqBench: Benchmarking the Capabilities of LLMs for Attack Sequences Understanding
by: Ma, Haokai, et al.
Published: (2025)
by: Ma, Haokai, et al.
Published: (2025)
Measuring Safety Alignment Effects in Autonomous Security Agents
by: David, Isaac, et al.
Published: (2026)
by: David, Isaac, et al.
Published: (2026)
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
by: Liu, Fan, et al.
Published: (2024)
by: Liu, Fan, et al.
Published: (2024)
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
by: Li, Siyuan, et al.
Published: (2026)
by: Li, Siyuan, et al.
Published: (2026)
A Model Stealing Attack Against Multi-Exit Networks
by: Pan, Li, et al.
Published: (2023)
by: Pan, Li, et al.
Published: (2023)
Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?
by: Yin, Qingyu, et al.
Published: (2025)
by: Yin, Qingyu, et al.
Published: (2025)
BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models
by: Liu, Shuaitong, et al.
Published: (2025)
by: Liu, Shuaitong, et al.
Published: (2025)
FlipAttack: Jailbreak LLMs via Flipping
by: Liu, Yue, et al.
Published: (2024)
by: Liu, Yue, et al.
Published: (2024)
FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment
by: Kuznetsov, Daniel, et al.
Published: (2026)
by: Kuznetsov, Daniel, et al.
Published: (2026)
VisuoAlign: Safety Alignment of LVLMs with Multimodal Tree Search
by: Li, MingSheng, et al.
Published: (2025)
by: Li, MingSheng, et al.
Published: (2025)
Attention Masks Help Adversarial Attacks to Bypass Safety Detectors
by: Shi, Yunfan
Published: (2024)
by: Shi, Yunfan
Published: (2024)
ShadowCode: Towards (Automatic) External Prompt Injection Attack against Code LLMs
by: Yang, Yuchen, et al.
Published: (2024)
by: Yang, Yuchen, et al.
Published: (2024)
Your Semantic-Independent Watermark is Fragile: A Semantic Perturbation Attack against EaaS Watermark
by: Fei, Zekun, et al.
Published: (2024)
by: Fei, Zekun, et al.
Published: (2024)
Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity
by: An, Hongjun, et al.
Published: (2026)
by: An, Hongjun, et al.
Published: (2026)
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment
by: Wang, Haoran, et al.
Published: (2023)
by: Wang, Haoran, et al.
Published: (2023)
Similar Items
-
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
by: Guo, Weiyang, et al.
Published: (2025) -
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
by: Ferrand, Jean-Charles Noirot, et al.
Published: (2025) -
Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks
by: Saha, Shoumik, et al.
Published: (2025) -
On the Feasibility of Using MultiModal LLMs to Execute AR Social Engineering Attacks
by: Bi, Ting, et al.
Published: (2025) -
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack
by: Li, Hao, et al.
Published: (2026)