:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Kumar, Sachin
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Cryptography and Security
Online Access:	https://arxiv.org/abs/2409.19476
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models
by: Yuan, Xiaohan, et al.
Published: (2024)

PersonaMark: Personalized LLM watermarking for model protection and user attribution
by: Zhang, Yuehan, et al.
Published: (2024)

Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
by: Zhang, Junbo, et al.
Published: (2025)

SVIP: Towards Verifiable Inference of Open-source Large Language Models
by: Sun, Yifan, et al.
Published: (2024)

Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety
by: Zhang, Yuyou, et al.
Published: (2025)

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
by: Wang, Zijun, et al.
Published: (2026)

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models
by: Wu, Jialin, et al.
Published: (2025)

Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution
by: Zhang, Xiaozhe, et al.
Published: (2026)

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models
by: Zhao, Yunhan, et al.
Published: (2026)

Analysing the Safety Pitfalls of Steering Vectors
by: Li, Yuxiao, et al.
Published: (2026)

OverrideFuzz: Semantic-Aware Grammar Fuzzing for Script-Runtime Vulnerabilities
by: Qiu, Yiran
Published: (2026)

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
by: Shen, Guobin, et al.
Published: (2024)

One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
by: Gu, Haoran, et al.
Published: (2025)

OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models
by: Wang, Thomas, et al.
Published: (2025)

Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models
by: Sakib, Shahnewaz Karim, et al.
Published: (2025)

Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
by: Yong, Zheng-Xin, et al.
Published: (2025)

Bag of Tricks for Subverting Reasoning-based Safety Guardrails
by: Chen, Shuo, et al.
Published: (2025)

SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming
by: Kumar, Anurakt, et al.
Published: (2024)

Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
by: Zeng, Xinyi, et al.
Published: (2024)

Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
by: Fu, Yu, et al.
Published: (2024)

PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks
by: Shen, Guobin, et al.
Published: (2025)

RAG Safety: Exploring Knowledge Poisoning Attacks to Retrieval-Augmented Generation
by: Zhao, Tianzhe, et al.
Published: (2025)

Representation Bending for Large Language Model Safety
by: Yousefpour, Ashkan, et al.
Published: (2025)

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
by: Wang, Jiongxiao, et al.
Published: (2024)

TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts
by: Chu, Hua-Rong, et al.
Published: (2026)

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety
by: Arif, Samee, et al.
Published: (2026)

Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models
by: Chowdhury, Arijit Ghosh, et al.
Published: (2024)

GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis
by: Xie, Yueqi, et al.
Published: (2024)

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
by: Huang, Caishuang, et al.
Published: (2024)

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak
by: Gu, Haoran, et al.
Published: (2026)

SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings
by: Lu, Weikai, et al.
Published: (2025)

Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
by: Vega, Jason, et al.
Published: (2023)

MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models
by: Xu, Chejian, et al.
Published: (2025)

Internal Safety Collapse in Frontier Large Language Models
by: Wu, Yutao, et al.
Published: (2026)

Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention
by: Singh, Himanshu, et al.
Published: (2026)

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay
by: Wang, Hao, et al.
Published: (2026)

BraveGuard: From Open-World Threats to Safer Computer-Use Agents
by: Feng, Yunhao, et al.
Published: (2026)

SGuard-v1: Safety Guardrail for Large Language Models
by: Lee, JoonHo, et al.
Published: (2025)

Mitigating Cyber Risk in the Age of Open-Weight LLMs: Policy Gaps and Technical Realities
by: de Gregorio, Alfonso
Published: (2025)

Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs
by: Wang, Jiawen, et al.
Published: (2025)