:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhu, Daniel, Wang, Zihan, Bao, Xuchan, Wei, Jerry
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence Cryptography and Security
Online Access:	https://arxiv.org/abs/2605.00267
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Tell me about yourself: LLMs are aware of their learned behaviors
by: Betley, Jan, et al.
Published: (2025)

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025)

Early Signs of Steganographic Capabilities in Frontier LLMs
by: Zolkowski, Artur, et al.
Published: (2025)

Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
by: Kumar, Priyanshu, et al.
Published: (2024)

LMEraser: Large Model Unlearning through Adaptive Prompt Tuning
by: Xu, Jie, et al.
Published: (2024)

Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models
by: Beerens, Lucas, et al.
Published: (2025)

Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation
by: Chen, Luoyu, et al.
Published: (2026)

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs
by: Panfilov, Alexander, et al.
Published: (2025)

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
by: Wang, Zhun, et al.
Published: (2025)

Security Considerations for Artificial Intelligence Agents
by: Li, Ninghui, et al.
Published: (2026)

An AI Architecture with the Capability to Classify and Explain Hardware Trojans
by: Whitten, Paul, et al.
Published: (2024)

Self-Destructive Language Model
by: Wang, Yuhui, et al.
Published: (2025)

Can AI Models be Jailbroken to Phish Elderly Victims? An End-to-End Evaluation
by: Heiding, Fred, et al.
Published: (2025)

SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs
by: Pan, Chao, et al.
Published: (2026)

From Firewalls to Frontiers: AI Red-Teaming is a Domain-Specific Evolution of Cyber Red-Teaming
by: Sinha, Anusha, et al.
Published: (2025)

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
by: Anurin, Andrey, et al.
Published: (2024)

Attacks and Defenses for Generative Diffusion Models: A Comprehensive Survey
by: Truong, Vu Tuan, et al.
Published: (2024)

BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents
by: Zhang, Kaiyuan, et al.
Published: (2025)

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs
by: Guo, Chuan, et al.
Published: (2026)

A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares
by: Cohen, Stav, et al.
Published: (2024)

The New Frontier of Cybersecurity: Emerging Threats and Innovations
by: Dave, Daksh, et al.
Published: (2023)

Password-Activated Shutdown Protocols for Misaligned Frontier Agents
by: Williams, Kai, et al.
Published: (2025)

Twin Auto-Encoder Model for Learning Separable Representation in Cyberattack Detection
by: Dinh, Phai Vu, et al.
Published: (2024)

MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors
by: Tong, Xin, et al.
Published: (2025)

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models
by: Peng, Benji, et al.
Published: (2024)

Generative Models are Self-Watermarked: Declaring Model Authentication through Re-Generation
by: Desu, Aditya, et al.
Published: (2024)

Generating Adversarial Point Clouds Using Diffusion Model
by: Zhao, Ruiyang, et al.
Published: (2025)

Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy
by: Junhao, Wei, et al.
Published: (2025)

UIFV: Data Reconstruction Attack in Vertical Federated Learning
by: Yang, Jirui, et al.
Published: (2024)

Privacy and Accuracy Implications of Model Complexity and Integration in Heterogeneous Federated Learning
by: Németh, Gergely Dániel, et al.
Published: (2023)

PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization
by: Ertan, Murat Bilgehan, et al.
Published: (2026)

FIT-Print: Towards False-claim-resistant Model Ownership Verification via Targeted Fingerprint
by: Shao, Shuo, et al.
Published: (2025)

Enhancing Security in Deep Reinforcement Learning: A Comprehensive Survey on Adversarial Attacks and Defenses
by: Yichao, Wu, et al.
Published: (2025)

Optimal Transport-Guided Adversarial Attacks on Graph Neural Network-Based Bot Detection
by: Mukherjee, Kunal, et al.
Published: (2026)

Exploring Privacy and Fairness Risks in Sharing Diffusion Models: An Adversarial Perspective
by: Luo, Xinjian, et al.
Published: (2024)

SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models
by: Fang, Junfeng, et al.
Published: (2025)

RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models
by: Liang, Jiacheng, et al.
Published: (2026)

TracLLM: A Generic Framework for Attributing Long Context LLMs
by: Wang, Yanting, et al.
Published: (2025)

Trusted Machine Learning Models Unlock Private Inference for Problems Currently Infeasible with Cryptography
by: Shumailov, Ilia, et al.
Published: (2025)

How Does a Deep Learning Model Architecture Impact Its Privacy? A Comprehensive Study of Privacy Attacks on CNNs and Transformers
by: Zhang, Guangsheng, et al.
Published: (2022)