Saved in:
| Main Authors: | Liang, Zhibo, Hu, Tianze, Chen, Zaiye, Tang, Mingjie |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.06716 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents
by: Zhang, Yixiang, et al.
Published: (2026)
by: Zhang, Yixiang, et al.
Published: (2026)
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models
by: Zhang, Jinchuan, et al.
Published: (2025)
by: Zhang, Jinchuan, et al.
Published: (2025)
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
by: Xu, Huiyu, et al.
Published: (2024)
by: Xu, Huiyu, et al.
Published: (2024)
From Threat Intelligence to Firewall Rules: Semantic Relations in Hybrid AI Agent and Expert System Architectures
by: Bonfanti, Chiara, et al.
Published: (2026)
by: Bonfanti, Chiara, et al.
Published: (2026)
ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models
by: Cheng, Siyang, et al.
Published: (2025)
by: Cheng, Siyang, et al.
Published: (2025)
AgentSOC: A Multi-Layer Agentic AI Framework for Security Operations Automation
by: Roy, Joyjit, et al.
Published: (2026)
by: Roy, Joyjit, et al.
Published: (2026)
Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models
by: Liu, Xiao, et al.
Published: (2024)
by: Liu, Xiao, et al.
Published: (2024)
Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection
by: Chen, Jiaqi, et al.
Published: (2024)
by: Chen, Jiaqi, et al.
Published: (2024)
Proof-of-Guardrail in AI Agents and What (Not) to Trust from It
by: Jin, Xisen, et al.
Published: (2026)
by: Jin, Xisen, et al.
Published: (2026)
CATMark: A Context-Aware Thresholding Framework for Robust Cross-Task Watermarking in Large Language Models
by: Zhang, Yu, et al.
Published: (2025)
by: Zhang, Yu, et al.
Published: (2025)
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
by: Cao, Bochuan, et al.
Published: (2023)
by: Cao, Bochuan, et al.
Published: (2023)
Waterfall: Framework for Robust and Scalable Text Watermarking and Provenance for LLMs
by: Lau, Gregory Kang Ruey, et al.
Published: (2024)
by: Lau, Gregory Kang Ruey, et al.
Published: (2024)
Towards Understanding the Cognitive Habits of Large Reasoning Models
by: Dong, Jianshuo, et al.
Published: (2025)
by: Dong, Jianshuo, et al.
Published: (2025)
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
by: Liang, Zi, et al.
Published: (2026)
by: Liang, Zi, et al.
Published: (2026)
Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs
by: Liu, Jinbo, et al.
Published: (2025)
by: Liu, Jinbo, et al.
Published: (2025)
CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models
by: Zhou, Guanghao, et al.
Published: (2025)
by: Zhou, Guanghao, et al.
Published: (2025)
SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment
by: Lin, Xixun, et al.
Published: (2026)
by: Lin, Xixun, et al.
Published: (2026)
AVISE: Framework for Evaluating the Security of AI Systems
by: Lempinen, Mikko, et al.
Published: (2026)
by: Lempinen, Mikko, et al.
Published: (2026)
Universal and Context-Independent Triggers for Precise Control of LLM Outputs
by: Liang, Jiashuo, et al.
Published: (2024)
by: Liang, Jiashuo, et al.
Published: (2024)
Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation
by: Qiao, Yuxuan, et al.
Published: (2025)
by: Qiao, Yuxuan, et al.
Published: (2025)
Decoupled Alignment for Robust Plug-and-Play Adaptation
by: Luo, Haozheng, et al.
Published: (2024)
by: Luo, Haozheng, et al.
Published: (2024)
DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection
by: Yan, Yuliang, et al.
Published: (2025)
by: Yan, Yuliang, et al.
Published: (2025)
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack
by: Li, Hao, et al.
Published: (2026)
by: Li, Hao, et al.
Published: (2026)
Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?
by: Karkevandi, Mohammad Bahrami, et al.
Published: (2024)
by: Karkevandi, Mohammad Bahrami, et al.
Published: (2024)
MAGE: Safeguarding LLM Agents against Long-Horizon Threats via Shadow Memory
by: Wang, Yuhui, et al.
Published: (2026)
by: Wang, Yuhui, et al.
Published: (2026)
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
by: Yang, Wenkai, et al.
Published: (2024)
by: Yang, Wenkai, et al.
Published: (2024)
Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems
by: Wang, Xiaoqing, et al.
Published: (2025)
by: Wang, Xiaoqing, et al.
Published: (2025)
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue
by: Wang, Fengxiang, et al.
Published: (2024)
by: Wang, Fengxiang, et al.
Published: (2024)
Textual Unlearning Gives a False Sense of Unlearning
by: Du, Jiacheng, et al.
Published: (2024)
by: Du, Jiacheng, et al.
Published: (2024)
RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction
by: Jiang, Tanqiu, et al.
Published: (2024)
by: Jiang, Tanqiu, et al.
Published: (2024)
PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System
by: Munoz, Gary D. Lopez, et al.
Published: (2024)
by: Munoz, Gary D. Lopez, et al.
Published: (2024)
NSmark: Null Space Based Black-box Watermarking Defense Framework for Language Models
by: Zhao, Haodong, et al.
Published: (2024)
by: Zhao, Haodong, et al.
Published: (2024)
Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models
by: Yu, Yongcan, et al.
Published: (2025)
by: Yu, Yongcan, et al.
Published: (2025)
SAMark: A Self-Anchored Text Watermarking with Paragraph-Level Paraphrase Robustness
by: Huo, Jiahao, et al.
Published: (2026)
by: Huo, Jiahao, et al.
Published: (2026)
QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language
by: Zou, Qingsong, et al.
Published: (2025)
by: Zou, Qingsong, et al.
Published: (2025)
LATTICE: Evaluating Decision Support Utility of Crypto Agents
by: Chan, Aaron, et al.
Published: (2026)
by: Chan, Aaron, et al.
Published: (2026)
Does Low Rank Adaptation Lead to Lower Robustness against Training-Time Attacks?
by: Liang, Zi, et al.
Published: (2025)
by: Liang, Zi, et al.
Published: (2025)
Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models
by: Yuan, Hongbang, et al.
Published: (2024)
by: Yuan, Hongbang, et al.
Published: (2024)
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region
by: Leong, Chak Tou, et al.
Published: (2025)
by: Leong, Chak Tou, et al.
Published: (2025)
LLMs Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
by: Hu, Xuhao, et al.
Published: (2025)
by: Hu, Xuhao, et al.
Published: (2025)
Similar Items
-
AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents
by: Zhang, Yixiang, et al.
Published: (2026) -
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models
by: Zhang, Jinchuan, et al.
Published: (2025) -
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
by: Xu, Huiyu, et al.
Published: (2024) -
From Threat Intelligence to Firewall Rules: Semantic Relations in Hybrid AI Agent and Expert System Architectures
by: Bonfanti, Chiara, et al.
Published: (2026) -
ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models
by: Cheng, Siyang, et al.
Published: (2025)