:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Halloran, John
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Cryptography and Security
Online Access:	https://arxiv.org/abs/2505.23634
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits
by: Radosevich, Brandon, et al.
Published: (2025)

Leveraging RAG for Training-Free Alignment of LLMs
by: Halloran, John T.
Published: (2026)

Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
by: Halloran, John T., et al.
Published: (2026)

Understanding the Effects of Safety Unalignment on Large Language Models
by: Halloran, John T.
Published: (2026)

MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers
by: Wang, Zhiqiang, et al.
Published: (2025)

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
by: Hu, Xulin, et al.
Published: (2026)

Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
by: Kumar, Priyanshu, et al.
Published: (2024)

Secure Multi-Modal Data Fusion in Federated Digital Health Systems via MCP
by: Aueawatthanaphisut, Aueaphum
Published: (2025)

HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor
by: Wu, Zihui, et al.
Published: (2025)

MCP-ITP: An Automated Framework for Implicit Tool Poisoning in MCP
by: Li, Ruiqi, et al.
Published: (2026)

VIPER-MCP: Detecting and Exploiting Taint-Style Vulnerabilities in Model Context Protocol Servers
by: Sun, Pengyu, et al.
Published: (2026)

MCP-in-SoS: Risk assessment framework for open-source MCP servers
by: Kumar, Pratyay, et al.
Published: (2026)

Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
by: An, Bang, et al.
Published: (2024)

GESR: Graph-Based Edge Semantic Reconstruction for Stealthy Communication Detection with Benign-Only Training
by: Xu, Henghui, et al.
Published: (2026)

Improving LLM Safety Alignment with Dual-Objective Optimization
by: Zhao, Xuandong, et al.
Published: (2025)

MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks
by: Hao, Run, et al.
Published: (2026)

Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models
by: Chu, Fengheng, et al.
Published: (2026)

MCP Guardian: A Security-First Layer for Safeguarding MCP-Based AI System
by: Kumar, Sonu, et al.
Published: (2025)

MCP-SandboxScan: WASM-based Secure Execution and Runtime Analysis for MCP Tools
by: Tan, Zhuoran, et al.
Published: (2026)

FedBAP: Backdoor Defense via Benign Adversarial Perturbation in Federated Learning
by: Yan, Xinhai, et al.
Published: (2025)

The Surprising Harmfulness of Benign Overfitting for Adversarial Robustness
by: Hao, Yifan, et al.
Published: (2024)

Learning to Look Benign: Targeted Evasion of Malware Detectors via API Import Injection
by: Dautartas, Juozas, et al.
Published: (2026)

EASE: Practical and Efficient Safety Alignment for Small Language Models
by: Shi, Haonan, et al.
Published: (2025)

TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning
by: Xie, Zhixin, et al.
Published: (2026)

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
by: Sheng, Leheng, et al.
Published: (2025)

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift
by: Yuan, Shuai, et al.
Published: (2025)

Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment
by: Ding, Sihao
Published: (2026)

SDN-Based False Data Detection With Its Mitigation and Machine Learning Robustness for In-Vehicle Networks
by: Dang, Long, et al.
Published: (2025)

What is in Your Safe Data? Identifying Benign Data that Breaks Safety
by: He, Luxi, et al.
Published: (2024)

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment
by: Ghosal, Soumya Suvra, et al.
Published: (2024)

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
by: Fu, Shaopeng, et al.
Published: (2026)

AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning
by: Zhang, Yi, et al.
Published: (2025)

Explainable Machine Learning for Phishing Detection on Heterogeneous Datasets with MCP-Enabled Deployment
by: Dora, Nikhil Kumar, et al.
Published: (2026)

We Urgently Need Privilege Management in MCP: A Measurement of API Usage in MCP Ecosystems
by: Li, Zhihao, et al.
Published: (2025)

What Does Normal Even Mean? Evaluating Benign Traffic in Intrusion Detection Datasets
by: Wilkinson, Meghan, et al.
Published: (2025)

Exploiting Defenses against GAN-Based Feature Inference Attacks in Federated Learning
by: Luo, Xinjian, et al.
Published: (2020)

Improved Generation of Adversarial Examples Against Safety-aligned LLMs
by: Li, Qizhang, et al.
Published: (2024)

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry
by: Lan, Wenhao, et al.
Published: (2026)

Furina: Fragmented Uncertainty-Driven Refusal Instability Attack
by: Wu, Tongxi, et al.
Published: (2026)

Securing the Model Context Protocol (MCP): Risks, Controls, and Governance
by: Errico, Herman, et al.
Published: (2025)