:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Yang, Fan
Format:	Preprint
Published:	2026
Subjects:	Cryptography and Security Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.10091
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
by: Guo, Weiyang, et al.
Published: (2025)

Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
by: Ferrand, Jean-Charles Noirot, et al.
Published: (2025)

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks
by: Saha, Shoumik, et al.
Published: (2025)

On the Feasibility of Using MultiModal LLMs to Execute AR Social Engineering Attacks
by: Bi, Ting, et al.
Published: (2025)

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack
by: Li, Hao, et al.
Published: (2026)

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment
by: Kim, Jaehan, et al.
Published: (2025)

BESA: Boosting Encoder Stealing Attack with Perturbation Recovery
by: Ren, Xuhao, et al.
Published: (2025)

Reimagining Safety Alignment with An Image
by: Xia, Yifan, et al.
Published: (2025)

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment
by: Ghosal, Soumya Suvra, et al.
Published: (2024)

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
by: Li, Xurui, et al.
Published: (2025)

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
by: Andriushchenko, Maksym, et al.
Published: (2024)

Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs
by: Zhao, Shuai, et al.
Published: (2024)

From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
by: Chae, Kyubyung, et al.
Published: (2025)

Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models
by: Yang, Fan
Published: (2025)

Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In
by: Nakash, Itay, et al.
Published: (2024)

Agent Safety Alignment via Reinforcement Learning
by: Sha, Zeyang, et al.
Published: (2025)

A Method for Enhancing the Safety of Large Model Generation Based on Multi-dimensional Attack and Defense
by: Zhai, Keke
Published: (2024)

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
by: Cao, Bochuan, et al.
Published: (2023)

Comprehensive Botnet Detection by Mitigating Adversarial Attacks, Navigating the Subtleties of Perturbation Distances and Fortifying Predictions with Conformal Layers
by: Yumlembam, Rahul, et al.
Published: (2024)

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
by: Xu, Zhao, et al.
Published: (2024)

Quantifying the Noise of Structural Perturbations on Graph Adversarial Attacks
by: Fang, Junyuan, et al.
Published: (2025)

BitAbuse: A Dataset of Visually Perturbed Texts for Defending Phishing Attacks
by: Lee, Hanyong, et al.
Published: (2025)

Scam Shield: Multi-Model Voting and Fine-Tuned LLMs Against Adversarial Attacks
by: Chang, Chen-Wei, et al.
Published: (2025)

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks
by: Tong, Haibo, et al.
Published: (2025)

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction
by: Wang, Hongtao, et al.
Published: (2026)

AttackSeqBench: Benchmarking the Capabilities of LLMs for Attack Sequences Understanding
by: Ma, Haokai, et al.
Published: (2025)

Measuring Safety Alignment Effects in Autonomous Security Agents
by: David, Isaac, et al.
Published: (2026)

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
by: Liu, Fan, et al.
Published: (2024)

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
by: Li, Siyuan, et al.
Published: (2026)

A Model Stealing Attack Against Multi-Exit Networks
by: Pan, Li, et al.
Published: (2023)

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?
by: Yin, Qingyu, et al.
Published: (2025)

BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models
by: Liu, Shuaitong, et al.
Published: (2025)

FlipAttack: Jailbreak LLMs via Flipping
by: Liu, Yue, et al.
Published: (2024)

FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment
by: Kuznetsov, Daniel, et al.
Published: (2026)

VisuoAlign: Safety Alignment of LVLMs with Multimodal Tree Search
by: Li, MingSheng, et al.
Published: (2025)

Attention Masks Help Adversarial Attacks to Bypass Safety Detectors
by: Shi, Yunfan
Published: (2024)

ShadowCode: Towards (Automatic) External Prompt Injection Attack against Code LLMs
by: Yang, Yuchen, et al.
Published: (2024)

Your Semantic-Independent Watermark is Fragile: A Semantic Perturbation Attack against EaaS Watermark
by: Fei, Zekun, et al.
Published: (2024)

Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity
by: An, Hongjun, et al.
Published: (2026)

Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment
by: Wang, Haoran, et al.
Published: (2023)