Saved in:
| Main Author: | Halloran, John |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.23634 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits
by: Radosevich, Brandon, et al.
Published: (2025)
by: Radosevich, Brandon, et al.
Published: (2025)
Leveraging RAG for Training-Free Alignment of LLMs
by: Halloran, John T.
Published: (2026)
by: Halloran, John T.
Published: (2026)
Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
by: Halloran, John T., et al.
Published: (2026)
by: Halloran, John T., et al.
Published: (2026)
Understanding the Effects of Safety Unalignment on Large Language Models
by: Halloran, John T.
Published: (2026)
by: Halloran, John T.
Published: (2026)
MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers
by: Wang, Zhiqiang, et al.
Published: (2025)
by: Wang, Zhiqiang, et al.
Published: (2025)
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
by: Hu, Xulin, et al.
Published: (2026)
by: Hu, Xulin, et al.
Published: (2026)
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
by: Kumar, Priyanshu, et al.
Published: (2024)
by: Kumar, Priyanshu, et al.
Published: (2024)
Secure Multi-Modal Data Fusion in Federated Digital Health Systems via MCP
by: Aueawatthanaphisut, Aueaphum
Published: (2025)
by: Aueawatthanaphisut, Aueaphum
Published: (2025)
HumorReject: Decoupling LLM Safety from Refusal Prefix via A Little Humor
by: Wu, Zihui, et al.
Published: (2025)
by: Wu, Zihui, et al.
Published: (2025)
MCP-ITP: An Automated Framework for Implicit Tool Poisoning in MCP
by: Li, Ruiqi, et al.
Published: (2026)
by: Li, Ruiqi, et al.
Published: (2026)
VIPER-MCP: Detecting and Exploiting Taint-Style Vulnerabilities in Model Context Protocol Servers
by: Sun, Pengyu, et al.
Published: (2026)
by: Sun, Pengyu, et al.
Published: (2026)
MCP-in-SoS: Risk assessment framework for open-source MCP servers
by: Kumar, Pratyay, et al.
Published: (2026)
by: Kumar, Pratyay, et al.
Published: (2026)
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
by: An, Bang, et al.
Published: (2024)
by: An, Bang, et al.
Published: (2024)
GESR: Graph-Based Edge Semantic Reconstruction for Stealthy Communication Detection with Benign-Only Training
by: Xu, Henghui, et al.
Published: (2026)
by: Xu, Henghui, et al.
Published: (2026)
Improving LLM Safety Alignment with Dual-Objective Optimization
by: Zhao, Xuandong, et al.
Published: (2025)
by: Zhao, Xuandong, et al.
Published: (2025)
MCP Pitfall Lab: Exposing Developer Pitfalls in MCP Tool Server Security under Multi-Vector Attacks
by: Hao, Run, et al.
Published: (2026)
by: Hao, Run, et al.
Published: (2026)
Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models
by: Chu, Fengheng, et al.
Published: (2026)
by: Chu, Fengheng, et al.
Published: (2026)
MCP Guardian: A Security-First Layer for Safeguarding MCP-Based AI System
by: Kumar, Sonu, et al.
Published: (2025)
by: Kumar, Sonu, et al.
Published: (2025)
MCP-SandboxScan: WASM-based Secure Execution and Runtime Analysis for MCP Tools
by: Tan, Zhuoran, et al.
Published: (2026)
by: Tan, Zhuoran, et al.
Published: (2026)
FedBAP: Backdoor Defense via Benign Adversarial Perturbation in Federated Learning
by: Yan, Xinhai, et al.
Published: (2025)
by: Yan, Xinhai, et al.
Published: (2025)
The Surprising Harmfulness of Benign Overfitting for Adversarial Robustness
by: Hao, Yifan, et al.
Published: (2024)
by: Hao, Yifan, et al.
Published: (2024)
Learning to Look Benign: Targeted Evasion of Malware Detectors via API Import Injection
by: Dautartas, Juozas, et al.
Published: (2026)
by: Dautartas, Juozas, et al.
Published: (2026)
EASE: Practical and Efficient Safety Alignment for Small Language Models
by: Shi, Haonan, et al.
Published: (2025)
by: Shi, Haonan, et al.
Published: (2025)
TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning
by: Xie, Zhixin, et al.
Published: (2026)
by: Xie, Zhixin, et al.
Published: (2026)
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
by: Sheng, Leheng, et al.
Published: (2025)
by: Sheng, Leheng, et al.
Published: (2025)
Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift
by: Yuan, Shuai, et al.
Published: (2025)
by: Yuan, Shuai, et al.
Published: (2025)
Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment
by: Ding, Sihao
Published: (2026)
by: Ding, Sihao
Published: (2026)
SDN-Based False Data Detection With Its Mitigation and Machine Learning Robustness for In-Vehicle Networks
by: Dang, Long, et al.
Published: (2025)
by: Dang, Long, et al.
Published: (2025)
What is in Your Safe Data? Identifying Benign Data that Breaks Safety
by: He, Luxi, et al.
Published: (2024)
by: He, Luxi, et al.
Published: (2024)
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment
by: Ghosal, Soumya Suvra, et al.
Published: (2024)
by: Ghosal, Soumya Suvra, et al.
Published: (2024)
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
by: Fu, Shaopeng, et al.
Published: (2026)
by: Fu, Shaopeng, et al.
Published: (2026)
AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning
by: Zhang, Yi, et al.
Published: (2025)
by: Zhang, Yi, et al.
Published: (2025)
Explainable Machine Learning for Phishing Detection on Heterogeneous Datasets with MCP-Enabled Deployment
by: Dora, Nikhil Kumar, et al.
Published: (2026)
by: Dora, Nikhil Kumar, et al.
Published: (2026)
We Urgently Need Privilege Management in MCP: A Measurement of API Usage in MCP Ecosystems
by: Li, Zhihao, et al.
Published: (2025)
by: Li, Zhihao, et al.
Published: (2025)
What Does Normal Even Mean? Evaluating Benign Traffic in Intrusion Detection Datasets
by: Wilkinson, Meghan, et al.
Published: (2025)
by: Wilkinson, Meghan, et al.
Published: (2025)
Exploiting Defenses against GAN-Based Feature Inference Attacks in Federated Learning
by: Luo, Xinjian, et al.
Published: (2020)
by: Luo, Xinjian, et al.
Published: (2020)
Improved Generation of Adversarial Examples Against Safety-aligned LLMs
by: Li, Qizhang, et al.
Published: (2024)
by: Li, Qizhang, et al.
Published: (2024)
Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry
by: Lan, Wenhao, et al.
Published: (2026)
by: Lan, Wenhao, et al.
Published: (2026)
Furina: Fragmented Uncertainty-Driven Refusal Instability Attack
by: Wu, Tongxi, et al.
Published: (2026)
by: Wu, Tongxi, et al.
Published: (2026)
Securing the Model Context Protocol (MCP): Risks, Controls, and Governance
by: Errico, Herman, et al.
Published: (2025)
by: Errico, Herman, et al.
Published: (2025)
Similar Items
-
MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits
by: Radosevich, Brandon, et al.
Published: (2025) -
Leveraging RAG for Training-Free Alignment of LLMs
by: Halloran, John T.
Published: (2026) -
Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
by: Halloran, John T., et al.
Published: (2026) -
Understanding the Effects of Safety Unalignment on Large Language Models
by: Halloran, John T.
Published: (2026) -
MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers
by: Wang, Zhiqiang, et al.
Published: (2025)