Saved in:
| Main Authors: | Wu, Zihui, Gao, Haichang, Luo, Jiacheng, Liu, Zhaoxiang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.13677 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Whispers of Data: Unveiling Label Distributions in Federated Learning Through Virtual Client Simulation
by: Ma, Zhixuan, et al.
Published: (2025)
by: Ma, Zhixuan, et al.
Published: (2025)
Furina: Fragmented Uncertainty-Driven Refusal Instability Attack
by: Wu, Tongxi, et al.
Published: (2026)
by: Wu, Tongxi, et al.
Published: (2026)
MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment
by: Halloran, John
Published: (2025)
by: Halloran, John
Published: (2025)
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
by: Wu, Zihui, et al.
Published: (2024)
by: Wu, Zihui, et al.
Published: (2024)
PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems
by: Pennas, Panagiotis Georgios, et al.
Published: (2026)
by: Pennas, Panagiotis Georgios, et al.
Published: (2026)
AdvPrefix: An Objective for Nuanced LLM Jailbreaks
by: Zhu, Sicheng, et al.
Published: (2024)
by: Zhu, Sicheng, et al.
Published: (2024)
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
by: Kumar, Priyanshu, et al.
Published: (2024)
by: Kumar, Priyanshu, et al.
Published: (2024)
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
by: Hu, Xulin, et al.
Published: (2026)
by: Hu, Xulin, et al.
Published: (2026)
NeST: Neuron Selective Tuning for LLM Safety
by: Behrouzi, Sasha, et al.
Published: (2026)
by: Behrouzi, Sasha, et al.
Published: (2026)
Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry
by: Lan, Wenhao, et al.
Published: (2026)
by: Lan, Wenhao, et al.
Published: (2026)
AutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning Models
by: Liang, Jiacheng, et al.
Published: (2025)
by: Liang, Jiacheng, et al.
Published: (2025)
LLM Security and Safety: Insights from Homotopy-Inspired Prompt Obfuscation
by: Lazo, Luis, et al.
Published: (2026)
by: Lazo, Luis, et al.
Published: (2026)
Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment
by: Ding, Sihao
Published: (2026)
by: Ding, Sihao
Published: (2026)
RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models
by: Liang, Jiacheng, et al.
Published: (2026)
by: Liang, Jiacheng, et al.
Published: (2026)
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
by: Sheng, Leheng, et al.
Published: (2025)
by: Sheng, Leheng, et al.
Published: (2025)
Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges
by: Eiras, Francisco, et al.
Published: (2025)
by: Eiras, Francisco, et al.
Published: (2025)
Unsafe LLM-Based Search: Quantitative Analysis and Mitigation of Safety Risks in AI Web Search
by: Luo, Zeren, et al.
Published: (2025)
by: Luo, Zeren, et al.
Published: (2025)
Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace
by: Yang, Jinluan, et al.
Published: (2024)
by: Yang, Jinluan, et al.
Published: (2024)
N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator
by: Lin, Zheyu, et al.
Published: (2025)
by: Lin, Zheyu, et al.
Published: (2025)
Global Context Enhanced Anomaly Detection of Cyber Attacks via Decoupled Graph Neural Networks
by: Hafez, Ahmad
Published: (2024)
by: Hafez, Ahmad
Published: (2024)
Watermark under Fire: A Robustness Evaluation of LLM Watermarking
by: Liang, Jiacheng, et al.
Published: (2024)
by: Liang, Jiacheng, et al.
Published: (2024)
Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment
by: Li, Yuxi, et al.
Published: (2024)
by: Li, Yuxi, et al.
Published: (2024)
MIST: Defending Against Membership Inference Attacks Through Membership-Invariant Subspace Training
by: Li, Jiacheng, et al.
Published: (2023)
by: Li, Jiacheng, et al.
Published: (2023)
One Stone, Two Birds: Enhancing Adversarial Defense Through the Lens of Distributional Discrepancy
by: Zhang, Jiacheng, et al.
Published: (2025)
by: Zhang, Jiacheng, et al.
Published: (2025)
MalRAG: A Retrieval-Augmented LLM Framework for Open-set Malicious Traffic Identification
by: Luo, Xiang, et al.
Published: (2025)
by: Luo, Xiang, et al.
Published: (2025)
Echoes within the Reasoning: Stealthy and Effective Watermarking via Chain of Thought
by: Lu, Jiacheng, et al.
Published: (2026)
by: Lu, Jiacheng, et al.
Published: (2026)
FedReview: A Review Mechanism for Rejecting Poisoned Updates in Federated Learning
by: Zheng, Tianhang, et al.
Published: (2024)
by: Zheng, Tianhang, et al.
Published: (2024)
ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning
by: Chen, Zhaorun, et al.
Published: (2025)
by: Chen, Zhaorun, et al.
Published: (2025)
Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift
by: Yuan, Shuai, et al.
Published: (2025)
by: Yuan, Shuai, et al.
Published: (2025)
Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems
by: He, Jun, et al.
Published: (2026)
by: He, Jun, et al.
Published: (2026)
PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features
by: Zou, Wei, et al.
Published: (2025)
by: Zou, Wei, et al.
Published: (2025)
RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs
by: Asif, Sadia, et al.
Published: (2026)
by: Asif, Sadia, et al.
Published: (2026)
A Systematic Security Evaluation of OpenClaw and Its Variants
by: Wang, Yuhang, et al.
Published: (2026)
by: Wang, Yuhang, et al.
Published: (2026)
PAE MobiLLM: Privacy-Aware and Efficient LLM Fine-Tuning on the Mobile Device via Additive Side-Tuning
by: Yang, Xingke, et al.
Published: (2025)
by: Yang, Xingke, et al.
Published: (2025)
DPAR: Decoupled Graph Neural Networks with Node-Level Differential Privacy
by: Zhang, Qiuchen, et al.
Published: (2022)
by: Zhang, Qiuchen, et al.
Published: (2022)
Model Extraction Attacks Revisited
by: Liang, Jiacheng, et al.
Published: (2023)
by: Liang, Jiacheng, et al.
Published: (2023)
LLM Fingerprinting via Semantically Conditioned Watermarks
by: Gloaguen, Thibaud, et al.
Published: (2025)
by: Gloaguen, Thibaud, et al.
Published: (2025)
Improving LLM Safety Alignment with Dual-Objective Optimization
by: Zhao, Xuandong, et al.
Published: (2025)
by: Zhao, Xuandong, et al.
Published: (2025)
Image-Perfect Imperfections: Safety, Bias, and Authenticity in the Shadow of Text-To-Image Model Evolution
by: Wu, Yixin, et al.
Published: (2024)
by: Wu, Yixin, et al.
Published: (2024)
Passive Inference Attacks on Split Learning via Adversarial Regularization
by: Zhu, Xiaochen, et al.
Published: (2023)
by: Zhu, Xiaochen, et al.
Published: (2023)
Similar Items
-
Whispers of Data: Unveiling Label Distributions in Federated Learning Through Virtual Client Simulation
by: Ma, Zhixuan, et al.
Published: (2025) -
Furina: Fragmented Uncertainty-Driven Refusal Instability Attack
by: Wu, Tongxi, et al.
Published: (2026) -
MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment
by: Halloran, John
Published: (2025) -
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
by: Wu, Zihui, et al.
Published: (2024) -
PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems
by: Pennas, Panagiotis Georgios, et al.
Published: (2026)