:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	He, Xuanli, Sel, Bilgehan, Ali, Faizan, Bao, Jenny, Cunningham, Hoagy, Wei, Jerry
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Cryptography and Security
Online Access:	https://arxiv.org/abs/2604.14865
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning
by: Sel, Bilgehan, et al.
Published: (2026)

Mitigating Jailbreaks with Intent-Aware LLMs
by: Yeo, Wei Jie, et al.
Published: (2025)

HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment
by: Belkhiter, Yannis, et al.
Published: (2024)

TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning
by: He, Xuanli, et al.
Published: (2024)

Cut the Deadwood Out: Backdoor Purification via Guided Module Substitution
by: Tong, Yao, et al.
Published: (2024)

SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks
by: He, Xuanli, et al.
Published: (2024)

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
by: Zhang, Chiyu, et al.
Published: (2025)

Deep Research Brings Deeper Harm
by: Chen, Shuo, et al.
Published: (2025)

One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs
by: Li, Linbao, et al.
Published: (2025)

Attacks on Third-Party APIs of Large Language Models
by: Zhao, Wanru, et al.
Published: (2024)

Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs
by: Chen, Yen-Shan, et al.
Published: (2026)

Defending against Backdoor Attacks via Module Switching
by: Li, Weijun, et al.
Published: (2025)

CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning
by: Yi, Biao, et al.
Published: (2025)

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
by: Luo, Xuan, et al.
Published: (2025)

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models
by: Wei, Zhang, et al.
Published: (2025)

Decoupled Alignment for Robust Plug-and-Play Adaptation
by: Luo, Haozheng, et al.
Published: (2024)

Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts
by: Hastuti, Rochana Prih, et al.
Published: (2025)

Automatic Generation of Web Censorship Probe Lists
by: Tang, Jenny, et al.
Published: (2024)

Do Reasoning LLMs Refuse What They Infer in Long Contexts?
by: Fu, Yu, et al.
Published: (2026)

TOSSS: a CVE-based Software Security Benchmark for Large Language Models
by: Damie, Marc, et al.
Published: (2026)

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks
by: Lu, Guoxin, et al.
Published: (2026)

DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection
by: Yan, Yuliang, et al.
Published: (2025)

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
by: Chrabąszcz, Maciej, et al.
Published: (2026)

garak: A Framework for Security Probing Large Language Models
by: Derczynski, Leon, et al.
Published: (2024)

Waterfall: Framework for Robust and Scalable Text Watermarking and Provenance for LLMs
by: Lau, Gregory Kang Ruey, et al.
Published: (2024)

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
by: Shen, Xinjie, et al.
Published: (2026)

Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention
by: Prosser, Ellie, et al.
Published: (2024)

Enhance Robustness of Language Models Against Variation Attack through Graph Integration
by: Xiong, Zi, et al.
Published: (2024)

SAMark: A Self-Anchored Text Watermarking with Paragraph-Level Paraphrase Robustness
by: Huo, Jiahao, et al.
Published: (2026)

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
by: Xing, Wenpeng, et al.
Published: (2025)

Token-Level Privacy in Large Language Models
by: Harel, Re'em, et al.
Published: (2025)

Fingerprinting LLMs via Prompt Injection
by: Hu, Yuepeng, et al.
Published: (2025)

LLMs for Domain Generation Algorithm Detection
by: La O, Reynier Leyva, et al.
Published: (2024)

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
by: Kabir, Md Rysul, et al.
Published: (2026)

GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection
by: Rad, Melissa Kazemi, et al.
Published: (2025)

SoK: Are Watermarks in LLMs Ready for Deployment?
by: Dang, Kieu, et al.
Published: (2025)

Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
by: Zeng, Xinyi, et al.
Published: (2024)

Fast-MIA: Efficient and Scalable Membership Inference for LLMs
by: Takahashi, Hiromu, et al.
Published: (2025)

AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing
by: Li, Yuexin, et al.
Published: (2026)

When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents
by: Jones, Jaylen, et al.
Published: (2026)