Saved in:
| Main Author: | Kumar, Sachin |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.19476 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models
by: Yuan, Xiaohan, et al.
Published: (2024)
by: Yuan, Xiaohan, et al.
Published: (2024)
PersonaMark: Personalized LLM watermarking for model protection and user attribution
by: Zhang, Yuehan, et al.
Published: (2024)
by: Zhang, Yuehan, et al.
Published: (2024)
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
by: Zhang, Junbo, et al.
Published: (2025)
by: Zhang, Junbo, et al.
Published: (2025)
SVIP: Towards Verifiable Inference of Open-source Large Language Models
by: Sun, Yifan, et al.
Published: (2024)
by: Sun, Yifan, et al.
Published: (2024)
Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety
by: Zhang, Yuyou, et al.
Published: (2025)
by: Zhang, Yuyou, et al.
Published: (2025)
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
by: Wang, Zijun, et al.
Published: (2026)
by: Wang, Zijun, et al.
Published: (2026)
EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models
by: Wu, Jialin, et al.
Published: (2025)
by: Wu, Jialin, et al.
Published: (2025)
Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution
by: Zhang, Xiaozhe, et al.
Published: (2026)
by: Zhang, Xiaozhe, et al.
Published: (2026)
ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models
by: Zhao, Yunhan, et al.
Published: (2026)
by: Zhao, Yunhan, et al.
Published: (2026)
Analysing the Safety Pitfalls of Steering Vectors
by: Li, Yuxiao, et al.
Published: (2026)
by: Li, Yuxiao, et al.
Published: (2026)
OverrideFuzz: Semantic-Aware Grammar Fuzzing for Script-Runtime Vulnerabilities
by: Qiu, Yiran
Published: (2026)
by: Qiu, Yiran
Published: (2026)
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
by: Shen, Guobin, et al.
Published: (2024)
by: Shen, Guobin, et al.
Published: (2024)
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
by: Gu, Haoran, et al.
Published: (2025)
by: Gu, Haoran, et al.
Published: (2025)
OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models
by: Wang, Thomas, et al.
Published: (2025)
by: Wang, Thomas, et al.
Published: (2025)
Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models
by: Sakib, Shahnewaz Karim, et al.
Published: (2025)
by: Sakib, Shahnewaz Karim, et al.
Published: (2025)
Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
by: Yong, Zheng-Xin, et al.
Published: (2025)
by: Yong, Zheng-Xin, et al.
Published: (2025)
Bag of Tricks for Subverting Reasoning-based Safety Guardrails
by: Chen, Shuo, et al.
Published: (2025)
by: Chen, Shuo, et al.
Published: (2025)
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming
by: Kumar, Anurakt, et al.
Published: (2024)
by: Kumar, Anurakt, et al.
Published: (2024)
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
by: Zeng, Xinyi, et al.
Published: (2024)
by: Zeng, Xinyi, et al.
Published: (2024)
Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
by: Fu, Yu, et al.
Published: (2024)
by: Fu, Yu, et al.
Published: (2024)
PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks
by: Shen, Guobin, et al.
Published: (2025)
by: Shen, Guobin, et al.
Published: (2025)
RAG Safety: Exploring Knowledge Poisoning Attacks to Retrieval-Augmented Generation
by: Zhao, Tianzhe, et al.
Published: (2025)
by: Zhao, Tianzhe, et al.
Published: (2025)
Representation Bending for Large Language Model Safety
by: Yousefpour, Ashkan, et al.
Published: (2025)
by: Yousefpour, Ashkan, et al.
Published: (2025)
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
by: Wang, Jiongxiao, et al.
Published: (2024)
by: Wang, Jiongxiao, et al.
Published: (2024)
TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts
by: Chu, Hua-Rong, et al.
Published: (2026)
by: Chu, Hua-Rong, et al.
Published: (2026)
One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety
by: Arif, Samee, et al.
Published: (2026)
by: Arif, Samee, et al.
Published: (2026)
Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models
by: Chowdhury, Arijit Ghosh, et al.
Published: (2024)
by: Chowdhury, Arijit Ghosh, et al.
Published: (2024)
GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis
by: Xie, Yueqi, et al.
Published: (2024)
by: Xie, Yueqi, et al.
Published: (2024)
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
by: Huang, Caishuang, et al.
Published: (2024)
by: Huang, Caishuang, et al.
Published: (2024)
Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak
by: Gu, Haoran, et al.
Published: (2026)
by: Gu, Haoran, et al.
Published: (2026)
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings
by: Lu, Weikai, et al.
Published: (2025)
by: Lu, Weikai, et al.
Published: (2025)
Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
by: Vega, Jason, et al.
Published: (2023)
by: Vega, Jason, et al.
Published: (2023)
MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models
by: Xu, Chejian, et al.
Published: (2025)
by: Xu, Chejian, et al.
Published: (2025)
Internal Safety Collapse in Frontier Large Language Models
by: Wu, Yutao, et al.
Published: (2026)
by: Wu, Yutao, et al.
Published: (2026)
Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention
by: Singh, Himanshu, et al.
Published: (2026)
by: Singh, Himanshu, et al.
Published: (2026)
Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay
by: Wang, Hao, et al.
Published: (2026)
by: Wang, Hao, et al.
Published: (2026)
BraveGuard: From Open-World Threats to Safer Computer-Use Agents
by: Feng, Yunhao, et al.
Published: (2026)
by: Feng, Yunhao, et al.
Published: (2026)
SGuard-v1: Safety Guardrail for Large Language Models
by: Lee, JoonHo, et al.
Published: (2025)
by: Lee, JoonHo, et al.
Published: (2025)
Mitigating Cyber Risk in the Age of Open-Weight LLMs: Policy Gaps and Technical Realities
by: de Gregorio, Alfonso
Published: (2025)
by: de Gregorio, Alfonso
Published: (2025)
Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs
by: Wang, Jiawen, et al.
Published: (2025)
by: Wang, Jiawen, et al.
Published: (2025)
Similar Items
-
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models
by: Yuan, Xiaohan, et al.
Published: (2024) -
PersonaMark: Personalized LLM watermarking for model protection and user attribution
by: Zhang, Yuehan, et al.
Published: (2024) -
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
by: Zhang, Junbo, et al.
Published: (2025) -
SVIP: Towards Verifiable Inference of Open-source Large Language Models
by: Sun, Yifan, et al.
Published: (2024) -
Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety
by: Zhang, Yuyou, et al.
Published: (2025)