Saved in:
| Main Authors: | Zhu, Daniel, Wang, Zihan, Bao, Xuchan, Wei, Jerry |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.00267 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Tell me about yourself: LLMs are aware of their learned behaviors
by: Betley, Jan, et al.
Published: (2025)
by: Betley, Jan, et al.
Published: (2025)
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025)
by: Betley, Jan, et al.
Published: (2025)
Early Signs of Steganographic Capabilities in Frontier LLMs
by: Zolkowski, Artur, et al.
Published: (2025)
by: Zolkowski, Artur, et al.
Published: (2025)
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
by: Kumar, Priyanshu, et al.
Published: (2024)
by: Kumar, Priyanshu, et al.
Published: (2024)
LMEraser: Large Model Unlearning through Adaptive Prompt Tuning
by: Xu, Jie, et al.
Published: (2024)
by: Xu, Jie, et al.
Published: (2024)
Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models
by: Beerens, Lucas, et al.
Published: (2025)
by: Beerens, Lucas, et al.
Published: (2025)
Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation
by: Chen, Luoyu, et al.
Published: (2026)
by: Chen, Luoyu, et al.
Published: (2026)
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs
by: Panfilov, Alexander, et al.
Published: (2025)
by: Panfilov, Alexander, et al.
Published: (2025)
CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
by: Wang, Zhun, et al.
Published: (2025)
by: Wang, Zhun, et al.
Published: (2025)
Security Considerations for Artificial Intelligence Agents
by: Li, Ninghui, et al.
Published: (2026)
by: Li, Ninghui, et al.
Published: (2026)
An AI Architecture with the Capability to Classify and Explain Hardware Trojans
by: Whitten, Paul, et al.
Published: (2024)
by: Whitten, Paul, et al.
Published: (2024)
Self-Destructive Language Model
by: Wang, Yuhui, et al.
Published: (2025)
by: Wang, Yuhui, et al.
Published: (2025)
Can AI Models be Jailbroken to Phish Elderly Victims? An End-to-End Evaluation
by: Heiding, Fred, et al.
Published: (2025)
by: Heiding, Fred, et al.
Published: (2025)
SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs
by: Pan, Chao, et al.
Published: (2026)
by: Pan, Chao, et al.
Published: (2026)
From Firewalls to Frontiers: AI Red-Teaming is a Domain-Specific Evolution of Cyber Red-Teaming
by: Sinha, Anusha, et al.
Published: (2025)
by: Sinha, Anusha, et al.
Published: (2025)
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
by: Anurin, Andrey, et al.
Published: (2024)
by: Anurin, Andrey, et al.
Published: (2024)
Attacks and Defenses for Generative Diffusion Models: A Comprehensive Survey
by: Truong, Vu Tuan, et al.
Published: (2024)
by: Truong, Vu Tuan, et al.
Published: (2024)
BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents
by: Zhang, Kaiyuan, et al.
Published: (2025)
by: Zhang, Kaiyuan, et al.
Published: (2025)
IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs
by: Guo, Chuan, et al.
Published: (2026)
by: Guo, Chuan, et al.
Published: (2026)
A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares
by: Cohen, Stav, et al.
Published: (2024)
by: Cohen, Stav, et al.
Published: (2024)
The New Frontier of Cybersecurity: Emerging Threats and Innovations
by: Dave, Daksh, et al.
Published: (2023)
by: Dave, Daksh, et al.
Published: (2023)
Password-Activated Shutdown Protocols for Misaligned Frontier Agents
by: Williams, Kai, et al.
Published: (2025)
by: Williams, Kai, et al.
Published: (2025)
Twin Auto-Encoder Model for Learning Separable Representation in Cyberattack Detection
by: Dinh, Phai Vu, et al.
Published: (2024)
by: Dinh, Phai Vu, et al.
Published: (2024)
MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors
by: Tong, Xin, et al.
Published: (2025)
by: Tong, Xin, et al.
Published: (2025)
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models
by: Peng, Benji, et al.
Published: (2024)
by: Peng, Benji, et al.
Published: (2024)
Generative Models are Self-Watermarked: Declaring Model Authentication through Re-Generation
by: Desu, Aditya, et al.
Published: (2024)
by: Desu, Aditya, et al.
Published: (2024)
Generating Adversarial Point Clouds Using Diffusion Model
by: Zhao, Ruiyang, et al.
Published: (2025)
by: Zhao, Ruiyang, et al.
Published: (2025)
Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy
by: Junhao, Wei, et al.
Published: (2025)
by: Junhao, Wei, et al.
Published: (2025)
UIFV: Data Reconstruction Attack in Vertical Federated Learning
by: Yang, Jirui, et al.
Published: (2024)
by: Yang, Jirui, et al.
Published: (2024)
Privacy and Accuracy Implications of Model Complexity and Integration in Heterogeneous Federated Learning
by: Németh, Gergely Dániel, et al.
Published: (2023)
by: Németh, Gergely Dániel, et al.
Published: (2023)
PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization
by: Ertan, Murat Bilgehan, et al.
Published: (2026)
by: Ertan, Murat Bilgehan, et al.
Published: (2026)
FIT-Print: Towards False-claim-resistant Model Ownership Verification via Targeted Fingerprint
by: Shao, Shuo, et al.
Published: (2025)
by: Shao, Shuo, et al.
Published: (2025)
Enhancing Security in Deep Reinforcement Learning: A Comprehensive Survey on Adversarial Attacks and Defenses
by: Yichao, Wu, et al.
Published: (2025)
by: Yichao, Wu, et al.
Published: (2025)
Optimal Transport-Guided Adversarial Attacks on Graph Neural Network-Based Bot Detection
by: Mukherjee, Kunal, et al.
Published: (2026)
by: Mukherjee, Kunal, et al.
Published: (2026)
Exploring Privacy and Fairness Risks in Sharing Diffusion Models: An Adversarial Perspective
by: Luo, Xinjian, et al.
Published: (2024)
by: Luo, Xinjian, et al.
Published: (2024)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models
by: Fang, Junfeng, et al.
Published: (2025)
by: Fang, Junfeng, et al.
Published: (2025)
RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models
by: Liang, Jiacheng, et al.
Published: (2026)
by: Liang, Jiacheng, et al.
Published: (2026)
TracLLM: A Generic Framework for Attributing Long Context LLMs
by: Wang, Yanting, et al.
Published: (2025)
by: Wang, Yanting, et al.
Published: (2025)
Trusted Machine Learning Models Unlock Private Inference for Problems Currently Infeasible with Cryptography
by: Shumailov, Ilia, et al.
Published: (2025)
by: Shumailov, Ilia, et al.
Published: (2025)
How Does a Deep Learning Model Architecture Impact Its Privacy? A Comprehensive Study of Privacy Attacks on CNNs and Transformers
by: Zhang, Guangsheng, et al.
Published: (2022)
by: Zhang, Guangsheng, et al.
Published: (2022)
Similar Items
-
Tell me about yourself: LLMs are aware of their learned behaviors
by: Betley, Jan, et al.
Published: (2025) -
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025) -
Early Signs of Steganographic Capabilities in Frontier LLMs
by: Zolkowski, Artur, et al.
Published: (2025) -
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
by: Kumar, Priyanshu, et al.
Published: (2024) -
LMEraser: Large Model Unlearning through Adaptive Prompt Tuning
by: Xu, Jie, et al.
Published: (2024)