Saved in:
| Main Authors: | Li, Hao, Li, Lijun, Lu, Zhenghao, Wei, Xianyi, Li, Rui, Shao, Jing, Sha, Lei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.18631 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay
by: Wang, Hao, et al.
Published: (2026)
by: Wang, Hao, et al.
Published: (2026)
HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment
by: Liu, Yuexiao, et al.
Published: (2025)
by: Liu, Yuexiao, et al.
Published: (2025)
Collaborative Shadows: Distributed Backdoor Attacks in LLM-Based Multi-Agent Systems
by: Zhu, Pengyu, et al.
Published: (2025)
by: Zhu, Pengyu, et al.
Published: (2025)
Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
by: Li, Xurui, et al.
Published: (2025)
by: Li, Xurui, et al.
Published: (2025)
Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models
by: Ding, Yi, et al.
Published: (2025)
by: Ding, Yi, et al.
Published: (2025)
Purified and Unified Steganographic Network
by: Li, Guobiao, et al.
Published: (2024)
by: Li, Guobiao, et al.
Published: (2024)
Safety Layers in Aligned Large Language Models: The Key to LLM Security
by: Li, Shen, et al.
Published: (2024)
by: Li, Shen, et al.
Published: (2024)
Safety Context Injection: Inference-Time Safety Alignment via Static Filtering and Agentic Analysis
by: Xu, Zhenhao, et al.
Published: (2026)
by: Xu, Zhenhao, et al.
Published: (2026)
Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios
by: Qu, Jingen, et al.
Published: (2025)
by: Qu, Jingen, et al.
Published: (2025)
Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities
by: Xiong, Yuan, et al.
Published: (2025)
by: Xiong, Yuan, et al.
Published: (2025)
GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners
by: Li, Haoran, et al.
Published: (2025)
by: Li, Haoran, et al.
Published: (2025)
Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment
by: Li, Yuxi, et al.
Published: (2024)
by: Li, Yuxi, et al.
Published: (2024)
Agent Safety Alignment via Reinforcement Learning
by: Sha, Zeyang, et al.
Published: (2025)
by: Sha, Zeyang, et al.
Published: (2025)
When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG
by: Li, Junchen, et al.
Published: (2026)
by: Li, Junchen, et al.
Published: (2026)
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
by: Zhang, Zaibin, et al.
Published: (2024)
by: Zhang, Zaibin, et al.
Published: (2024)
Reimagining Safety Alignment with An Image
by: Xia, Yifan, et al.
Published: (2025)
by: Xia, Yifan, et al.
Published: (2025)
GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms
by: He, Sinan, et al.
Published: (2025)
by: He, Sinan, et al.
Published: (2025)
FedTDP: A Privacy-Preserving and Unified Framework for Trajectory Data Preparation via Federated Learning
by: Zeng, Zhihao, et al.
Published: (2025)
by: Zeng, Zhihao, et al.
Published: (2025)
DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning
by: Zhang, Junbo, et al.
Published: (2026)
by: Zhang, Junbo, et al.
Published: (2026)
Towards Privacy-Preserving Range Queries with Secure Learned Spatial Index over Encrypted Data
by: Wang, Zuan, et al.
Published: (2025)
by: Wang, Zuan, et al.
Published: (2025)
USCSA: Evolution-Aware Security Analysis for Proxy-Based Upgradeable Smart Contracts
by: Li, Xiaoqi, et al.
Published: (2025)
by: Li, Xiaoqi, et al.
Published: (2025)
T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
by: Li, Lijun, et al.
Published: (2025)
by: Li, Lijun, et al.
Published: (2025)
Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference
by: Li, Jianwei, et al.
Published: (2026)
by: Li, Jianwei, et al.
Published: (2026)
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
by: Guo, Weiyang, et al.
Published: (2025)
by: Guo, Weiyang, et al.
Published: (2025)
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
by: Wang, Jiongxiao, et al.
Published: (2024)
by: Wang, Jiongxiao, et al.
Published: (2024)
Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training
by: Weng, Fenghua, et al.
Published: (2025)
by: Weng, Fenghua, et al.
Published: (2025)
Rethinking Fraud Safety Evaluation: Multi-Round Attacks Reveal Safety-Utility Tradeoffs in Graph-Context LLM Defenders
by: Jiang, Laura, et al.
Published: (2026)
by: Jiang, Laura, et al.
Published: (2026)
DPBloomfilter: Securing Bloom Filters with Differential Privacy
by: Ke, Yekun, et al.
Published: (2025)
by: Ke, Yekun, et al.
Published: (2025)
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
by: Miao, Ziqi, et al.
Published: (2025)
by: Miao, Ziqi, et al.
Published: (2025)
VisuoAlign: Safety Alignment of LVLMs with Multimodal Tree Search
by: Li, MingSheng, et al.
Published: (2025)
by: Li, MingSheng, et al.
Published: (2025)
RL-Finetuned LLMs for Privacy-Preserving Synthetic Rewriting
by: Shi, Zhan, et al.
Published: (2025)
by: Shi, Zhan, et al.
Published: (2025)
OpenClaw PRISM: A Zero-Fork, Defense-in-Depth Runtime Security Layer for Tool-Augmented LLM Agents
by: Li, Frank
Published: (2026)
by: Li, Frank
Published: (2026)
FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment
by: Kuznetsov, Daniel, et al.
Published: (2026)
by: Kuznetsov, Daniel, et al.
Published: (2026)
Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs
by: Davies, Xander, et al.
Published: (2025)
by: Davies, Xander, et al.
Published: (2025)
"No Matter What You Do": Purifying GNN Models via Backdoor Unlearning
by: Zhang, Jiale, et al.
Published: (2024)
by: Zhang, Jiale, et al.
Published: (2024)
CodePurify: Defend Backdoor Attacks on Neural Code Models via Entropy-based Purification
by: Mu, Fangwen, et al.
Published: (2024)
by: Mu, Fangwen, et al.
Published: (2024)
TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations
by: Chen, Xiuyuan, et al.
Published: (2025)
by: Chen, Xiuyuan, et al.
Published: (2025)
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
by: Li, Lijun, et al.
Published: (2024)
by: Li, Lijun, et al.
Published: (2024)
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing
by: Wei, Zeming, et al.
Published: (2026)
by: Wei, Zeming, et al.
Published: (2026)
Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models
by: Tan, Rui Yang, et al.
Published: (2026)
by: Tan, Rui Yang, et al.
Published: (2026)
Similar Items
-
Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay
by: Wang, Hao, et al.
Published: (2026) -
HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment
by: Liu, Yuexiao, et al.
Published: (2025) -
Collaborative Shadows: Distributed Backdoor Attacks in LLM-Based Multi-Agent Systems
by: Zhu, Pengyu, et al.
Published: (2025) -
Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
by: Li, Xurui, et al.
Published: (2025) -
Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models
by: Ding, Yi, et al.
Published: (2025)