Saved in:
| Main Authors: | He, Xuanli, Sel, Bilgehan, Ali, Faizan, Bao, Jenny, Cunningham, Hoagy, Wei, Jerry |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.14865 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning
by: Sel, Bilgehan, et al.
Published: (2026)
by: Sel, Bilgehan, et al.
Published: (2026)
Mitigating Jailbreaks with Intent-Aware LLMs
by: Yeo, Wei Jie, et al.
Published: (2025)
by: Yeo, Wei Jie, et al.
Published: (2025)
HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment
by: Belkhiter, Yannis, et al.
Published: (2024)
by: Belkhiter, Yannis, et al.
Published: (2024)
TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning
by: He, Xuanli, et al.
Published: (2024)
by: He, Xuanli, et al.
Published: (2024)
Cut the Deadwood Out: Backdoor Purification via Guided Module Substitution
by: Tong, Yao, et al.
Published: (2024)
by: Tong, Yao, et al.
Published: (2024)
SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks
by: He, Xuanli, et al.
Published: (2024)
by: He, Xuanli, et al.
Published: (2024)
Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
by: Zhang, Chiyu, et al.
Published: (2025)
by: Zhang, Chiyu, et al.
Published: (2025)
Deep Research Brings Deeper Harm
by: Chen, Shuo, et al.
Published: (2025)
by: Chen, Shuo, et al.
Published: (2025)
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs
by: Li, Linbao, et al.
Published: (2025)
by: Li, Linbao, et al.
Published: (2025)
Attacks on Third-Party APIs of Large Language Models
by: Zhao, Wanru, et al.
Published: (2024)
by: Zhao, Wanru, et al.
Published: (2024)
Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs
by: Chen, Yen-Shan, et al.
Published: (2026)
by: Chen, Yen-Shan, et al.
Published: (2026)
Defending against Backdoor Attacks via Module Switching
by: Li, Weijun, et al.
Published: (2025)
by: Li, Weijun, et al.
Published: (2025)
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning
by: Yi, Biao, et al.
Published: (2025)
by: Yi, Biao, et al.
Published: (2025)
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
by: Luo, Xuan, et al.
Published: (2025)
by: Luo, Xuan, et al.
Published: (2025)
Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models
by: Wei, Zhang, et al.
Published: (2025)
by: Wei, Zhang, et al.
Published: (2025)
Decoupled Alignment for Robust Plug-and-Play Adaptation
by: Luo, Haozheng, et al.
Published: (2024)
by: Luo, Haozheng, et al.
Published: (2024)
Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts
by: Hastuti, Rochana Prih, et al.
Published: (2025)
by: Hastuti, Rochana Prih, et al.
Published: (2025)
Automatic Generation of Web Censorship Probe Lists
by: Tang, Jenny, et al.
Published: (2024)
by: Tang, Jenny, et al.
Published: (2024)
Do Reasoning LLMs Refuse What They Infer in Long Contexts?
by: Fu, Yu, et al.
Published: (2026)
by: Fu, Yu, et al.
Published: (2026)
TOSSS: a CVE-based Software Security Benchmark for Large Language Models
by: Damie, Marc, et al.
Published: (2026)
by: Damie, Marc, et al.
Published: (2026)
Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks
by: Lu, Guoxin, et al.
Published: (2026)
by: Lu, Guoxin, et al.
Published: (2026)
DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection
by: Yan, Yuliang, et al.
Published: (2025)
by: Yan, Yuliang, et al.
Published: (2025)
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
by: Chrabąszcz, Maciej, et al.
Published: (2026)
by: Chrabąszcz, Maciej, et al.
Published: (2026)
garak: A Framework for Security Probing Large Language Models
by: Derczynski, Leon, et al.
Published: (2024)
by: Derczynski, Leon, et al.
Published: (2024)
Waterfall: Framework for Robust and Scalable Text Watermarking and Provenance for LLMs
by: Lau, Gregory Kang Ruey, et al.
Published: (2024)
by: Lau, Gregory Kang Ruey, et al.
Published: (2024)
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
by: Shen, Xinjie, et al.
Published: (2026)
by: Shen, Xinjie, et al.
Published: (2026)
Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention
by: Prosser, Ellie, et al.
Published: (2024)
by: Prosser, Ellie, et al.
Published: (2024)
Enhance Robustness of Language Models Against Variation Attack through Graph Integration
by: Xiong, Zi, et al.
Published: (2024)
by: Xiong, Zi, et al.
Published: (2024)
SAMark: A Self-Anchored Text Watermarking with Paragraph-Level Paraphrase Robustness
by: Huo, Jiahao, et al.
Published: (2026)
by: Huo, Jiahao, et al.
Published: (2026)
Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
by: Xing, Wenpeng, et al.
Published: (2025)
by: Xing, Wenpeng, et al.
Published: (2025)
Token-Level Privacy in Large Language Models
by: Harel, Re'em, et al.
Published: (2025)
by: Harel, Re'em, et al.
Published: (2025)
Fingerprinting LLMs via Prompt Injection
by: Hu, Yuepeng, et al.
Published: (2025)
by: Hu, Yuepeng, et al.
Published: (2025)
LLMs for Domain Generation Algorithm Detection
by: La O, Reynier Leyva, et al.
Published: (2024)
by: La O, Reynier Leyva, et al.
Published: (2024)
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
by: Kabir, Md Rysul, et al.
Published: (2026)
by: Kabir, Md Rysul, et al.
Published: (2026)
GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection
by: Rad, Melissa Kazemi, et al.
Published: (2025)
by: Rad, Melissa Kazemi, et al.
Published: (2025)
SoK: Are Watermarks in LLMs Ready for Deployment?
by: Dang, Kieu, et al.
Published: (2025)
by: Dang, Kieu, et al.
Published: (2025)
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
by: Zeng, Xinyi, et al.
Published: (2024)
by: Zeng, Xinyi, et al.
Published: (2024)
Fast-MIA: Efficient and Scalable Membership Inference for LLMs
by: Takahashi, Hiromu, et al.
Published: (2025)
by: Takahashi, Hiromu, et al.
Published: (2025)
AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing
by: Li, Yuexin, et al.
Published: (2026)
by: Li, Yuexin, et al.
Published: (2026)
When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents
by: Jones, Jaylen, et al.
Published: (2026)
by: Jones, Jaylen, et al.
Published: (2026)
Similar Items
-
Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning
by: Sel, Bilgehan, et al.
Published: (2026) -
Mitigating Jailbreaks with Intent-Aware LLMs
by: Yeo, Wei Jie, et al.
Published: (2025) -
HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment
by: Belkhiter, Yannis, et al.
Published: (2024) -
TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning
by: He, Xuanli, et al.
Published: (2024) -
Cut the Deadwood Out: Backdoor Purification via Guided Module Substitution
by: Tong, Yao, et al.
Published: (2024)