Saved in:
| Main Authors: | Cui, Shiyao, Zhang, Zhenyu, Chen, Yilong, Zhang, Wenyuan, Liu, Tianyun, Wang, Siqi, Liu, Tingwen |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2311.18580 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
by: Xie, Yuanbo, et al.
Published: (2025)
by: Xie, Yuanbo, et al.
Published: (2025)
T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
by: Li, Lijun, et al.
Published: (2025)
by: Li, Lijun, et al.
Published: (2025)
Safety Alignment Should Be Made More Than Just A Few Attention Heads
by: Huang, Chao, et al.
Published: (2025)
by: Huang, Chao, et al.
Published: (2025)
Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
by: Xing, Wenpeng, et al.
Published: (2025)
by: Xing, Wenpeng, et al.
Published: (2025)
Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts
by: Hastuti, Rochana Prih, et al.
Published: (2025)
by: Hastuti, Rochana Prih, et al.
Published: (2025)
Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game
by: Xie, Yuanbo, et al.
Published: (2026)
by: Xie, Yuanbo, et al.
Published: (2026)
Jailbreaking LLMs via Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge
by: Xu, Ning, et al.
Published: (2025)
by: Xu, Ning, et al.
Published: (2025)
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models
by: Yuan, Xiaohan, et al.
Published: (2024)
by: Yuan, Xiaohan, et al.
Published: (2024)
Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs
by: Fastowski, Alina, et al.
Published: (2025)
by: Fastowski, Alina, et al.
Published: (2025)
From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks
by: Zhang, Zhexin, et al.
Published: (2024)
by: Zhang, Zhexin, et al.
Published: (2024)
The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs
by: Liu, Songyang, et al.
Published: (2025)
by: Liu, Songyang, et al.
Published: (2025)
How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference
by: Liu, Songyang, et al.
Published: (2026)
by: Liu, Songyang, et al.
Published: (2026)
Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings
by: Ying, Zonghao, et al.
Published: (2025)
by: Ying, Zonghao, et al.
Published: (2025)
Toward Copyright Integrity and Verifiability via Multi-Bit Watermarking for Intelligent Transportation Systems
by: Wang, Yihao, et al.
Published: (2025)
by: Wang, Yihao, et al.
Published: (2025)
PMark: Towards Robust and Distortion-free Semantic-level Watermarking with Channel Constraints
by: Huo, Jiahao, et al.
Published: (2025)
by: Huo, Jiahao, et al.
Published: (2025)
Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
by: Zhang, Chiyu, et al.
Published: (2025)
by: Zhang, Chiyu, et al.
Published: (2025)
From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation
by: Liang, Yan, et al.
Published: (2026)
by: Liang, Yan, et al.
Published: (2026)
Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models
by: Sakib, Shahnewaz Karim, et al.
Published: (2025)
by: Sakib, Shahnewaz Karim, et al.
Published: (2025)
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
by: Zhang, Rui, et al.
Published: (2026)
by: Zhang, Rui, et al.
Published: (2026)
Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints
by: Yang, Junxiao, et al.
Published: (2025)
by: Yang, Junxiao, et al.
Published: (2025)
Efficient Detection of Toxic Prompts in Large Language Models
by: Liu, Yi, et al.
Published: (2024)
by: Liu, Yi, et al.
Published: (2024)
from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors
by: Yan, Yu, et al.
Published: (2025)
by: Yan, Yu, et al.
Published: (2025)
Towards Building a Robust Toxicity Predictor
by: Bespalov, Dmitriy, et al.
Published: (2024)
by: Bespalov, Dmitriy, et al.
Published: (2024)
Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking
by: Gu, Tianle, et al.
Published: (2025)
by: Gu, Tianle, et al.
Published: (2025)
PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
by: Zhu, Kaijie, et al.
Published: (2023)
by: Zhu, Kaijie, et al.
Published: (2023)
Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment
by: Li, Jie, et al.
Published: (2024)
by: Li, Jie, et al.
Published: (2024)
Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers
by: Wei, Jiali, et al.
Published: (2026)
by: Wei, Jiali, et al.
Published: (2026)
Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak
by: Gu, Haoran, et al.
Published: (2026)
by: Gu, Haoran, et al.
Published: (2026)
Dataset Protection via Watermarked Canaries in Retrieval-Augmented LLMs
by: Liu, Yepeng, et al.
Published: (2025)
by: Liu, Yepeng, et al.
Published: (2025)
Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare
by: Zhang, Hang, et al.
Published: (2025)
by: Zhang, Hang, et al.
Published: (2025)
TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs
by: Rajore, Tanmay, et al.
Published: (2024)
by: Rajore, Tanmay, et al.
Published: (2024)
More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles
by: Chen, Ruibo, et al.
Published: (2026)
by: Chen, Ruibo, et al.
Published: (2026)
Semantic-Preserving Adversarial Attacks on LLMs: An Adaptive Greedy Binary Search Approach
by: Zhang, Chong, et al.
Published: (2025)
by: Zhang, Chong, et al.
Published: (2025)
Lightweight Yet Secure: Secure Scripting Language Generation via Lightweight LLMs
by: Zhang, Keyang, et al.
Published: (2026)
by: Zhang, Keyang, et al.
Published: (2026)
FreqMark: Frequency-Based Watermark for Sentence-Level Detection of LLM-Generated Text
by: Xu, Zhenyu, et al.
Published: (2024)
by: Xu, Zhenyu, et al.
Published: (2024)
Towards Label-Only Membership Inference Attack against Pre-trained Large Language Models
by: He, Yu, et al.
Published: (2025)
by: He, Yu, et al.
Published: (2025)
The Model's Language Matters: A Comparative Privacy Analysis of LLMs
by: Mishra, Abhishek K., et al.
Published: (2025)
by: Mishra, Abhishek K., et al.
Published: (2025)
Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs
by: Li, Xiang, et al.
Published: (2025)
by: Li, Xiang, et al.
Published: (2025)
Auditing Data Membership in Reinforcement Learning With Verifiable Rewards
by: Liu, Yule, et al.
Published: (2025)
by: Liu, Yule, et al.
Published: (2025)
PREE: Towards Harmless and Adaptive Fingerprint Editing in Large Language Models via Knowledge Prefix Enhancement
by: Yue, Xubin, et al.
Published: (2025)
by: Yue, Xubin, et al.
Published: (2025)
Similar Items
-
Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
by: Xie, Yuanbo, et al.
Published: (2025) -
T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
by: Li, Lijun, et al.
Published: (2025) -
Safety Alignment Should Be Made More Than Just A Few Attention Heads
by: Huang, Chao, et al.
Published: (2025) -
Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
by: Xing, Wenpeng, et al.
Published: (2025) -
Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts
by: Hastuti, Rochana Prih, et al.
Published: (2025)