:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Cui, Shiyao, Zhang, Zhenyu, Chen, Yilong, Zhang, Wenyuan, Liu, Tianyun, Wang, Siqi, Liu, Tingwen
Format:	Preprint
Published:	2023
Subjects:	Computation and Language Cryptography and Security
Online Access:	https://arxiv.org/abs/2311.18580
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
by: Xie, Yuanbo, et al.
Published: (2025)

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
by: Li, Lijun, et al.
Published: (2025)

Safety Alignment Should Be Made More Than Just A Few Attention Heads
by: Huang, Chao, et al.
Published: (2025)

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
by: Xing, Wenpeng, et al.
Published: (2025)

Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts
by: Hastuti, Rochana Prih, et al.
Published: (2025)

Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game
by: Xie, Yuanbo, et al.
Published: (2026)

Jailbreaking LLMs via Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge
by: Xu, Ning, et al.
Published: (2025)

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models
by: Yuan, Xiaohan, et al.
Published: (2024)

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs
by: Fastowski, Alina, et al.
Published: (2025)

From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks
by: Zhang, Zhexin, et al.
Published: (2024)

The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs
by: Liu, Songyang, et al.
Published: (2025)

How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference
by: Liu, Songyang, et al.
Published: (2026)

Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings
by: Ying, Zonghao, et al.
Published: (2025)

Toward Copyright Integrity and Verifiability via Multi-Bit Watermarking for Intelligent Transportation Systems
by: Wang, Yihao, et al.
Published: (2025)

PMark: Towards Robust and Distortion-free Semantic-level Watermarking with Channel Constraints
by: Huo, Jiahao, et al.
Published: (2025)

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
by: Zhang, Chiyu, et al.
Published: (2025)

From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation
by: Liang, Yan, et al.
Published: (2026)

Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models
by: Sakib, Shahnewaz Karim, et al.
Published: (2025)

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
by: Zhang, Rui, et al.
Published: (2026)

Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints
by: Yang, Junxiao, et al.
Published: (2025)

Efficient Detection of Toxic Prompts in Large Language Models
by: Liu, Yi, et al.
Published: (2024)

from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors
by: Yan, Yu, et al.
Published: (2025)

Towards Building a Robust Toxicity Predictor
by: Bespalov, Dmitriy, et al.
Published: (2024)

Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking
by: Gu, Tianle, et al.
Published: (2025)

PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
by: Zhu, Kaijie, et al.
Published: (2023)

Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment
by: Li, Jie, et al.
Published: (2024)

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers
by: Wei, Jiali, et al.
Published: (2026)

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak
by: Gu, Haoran, et al.
Published: (2026)

Dataset Protection via Watermarked Canaries in Retrieval-Augmented LLMs
by: Liu, Yepeng, et al.
Published: (2025)

Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare
by: Zhang, Hang, et al.
Published: (2025)

TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs
by: Rajore, Tanmay, et al.
Published: (2024)

More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles
by: Chen, Ruibo, et al.
Published: (2026)

Semantic-Preserving Adversarial Attacks on LLMs: An Adaptive Greedy Binary Search Approach
by: Zhang, Chong, et al.
Published: (2025)

Lightweight Yet Secure: Secure Scripting Language Generation via Lightweight LLMs
by: Zhang, Keyang, et al.
Published: (2026)

FreqMark: Frequency-Based Watermark for Sentence-Level Detection of LLM-Generated Text
by: Xu, Zhenyu, et al.
Published: (2024)

Towards Label-Only Membership Inference Attack against Pre-trained Large Language Models
by: He, Yu, et al.
Published: (2025)

The Model's Language Matters: A Comparative Privacy Analysis of LLMs
by: Mishra, Abhishek K., et al.
Published: (2025)

Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs
by: Li, Xiang, et al.
Published: (2025)

Auditing Data Membership in Reinforcement Learning With Verifiable Rewards
by: Liu, Yule, et al.
Published: (2025)

PREE: Towards Harmless and Adaptive Fingerprint Editing in Large Language Models via Knowledge Prefix Enhancement
by: Yue, Xubin, et al.
Published: (2025)