:: Library Catalog

Copertina

Salvato in:

Dettagli Bibliografici
Autori principali:	Singh, Himanshu, Xu, Ziwei, Subramanyam, A. V., Kankanhalli, Mohan
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Computation and Language Cryptography and Security
Accesso online:	https://arxiv.org/abs/2602.06623
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

Documenti analoghi

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents
di: Nawal, Aditya, et al.
Pubblicazione: (2026)

The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense
di: Guo, Yangyang, et al.
Pubblicazione: (2024)

LLMs Can Unlearn Refusal with Only 1,000 Benign Samples
di: Guo, Yangyang, et al.
Pubblicazione: (2026)

Involuntary Jailbreak: On Self-Prompting Attacks
di: Guo, Yangyang, et al.
Pubblicazione: (2025)

Preference Tuning For Toxicity Mitigation Generalizes Across Languages
di: Li, Xiaochen, et al.
Pubblicazione: (2024)

Technical Report for ICML 2024 TiFA Workshop MLLM Attack Challenge: Suffix Injection and Projected Gradient Descent Can Easily Fool An MLLM
di: Guo, Yangyang, et al.
Pubblicazione: (2024)

Certifying LLM Safety against Adversarial Prompting
di: Kumar, Aounon, et al.
Pubblicazione: (2023)

DPTraj-PM: Differentially Private Trajectory Synthesis Using Prefix Tree and Markov Process
di: Wang, Nana, et al.
Pubblicazione: (2024)

Language Guided Adversarial Purification
di: Singh, Himanshu, et al.
Pubblicazione: (2023)

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
di: Wang, Jiongxiao, et al.
Pubblicazione: (2024)

Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
di: Zhang, Junbo, et al.
Pubblicazione: (2025)

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
di: Li, Lijun, et al.
Pubblicazione: (2025)

Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety
di: Zhang, Yuyou, et al.
Pubblicazione: (2025)

Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning
di: Wang, Yanbo, et al.
Pubblicazione: (2026)

Prompt Optimization and Evaluation for LLM Automated Red Teaming
di: Freenor, Michael, et al.
Pubblicazione: (2025)

Confidential Prompting: Privacy-preserving LLM Inference on Cloud
di: Li, Caihua, et al.
Pubblicazione: (2024)

Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications
di: Wang, Junlin, et al.
Pubblicazione: (2024)

GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis
di: Xie, Yueqi, et al.
Pubblicazione: (2024)

Jailbreaking LLMs via Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge
di: Xu, Ning, et al.
Pubblicazione: (2025)

Self-Disguise Attack: Induce the LLM to disguise itself for AIGT detection evasion
di: Zhou, Yinghan, et al.
Pubblicazione: (2025)

Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
di: Zeng, Xinyi, et al.
Pubblicazione: (2024)

The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis
di: Wang, Peiran, et al.
Pubblicazione: (2026)

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
di: Zhou, Zhenhong, et al.
Pubblicazione: (2024)

SecureForge: Finding and Preventing Vulnerabilities in LLM-Generated Code via Prompt Optimization
di: Liu, Houjun, et al.
Pubblicazione: (2026)

GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods
di: Huang, Ruixuan, et al.
Pubblicazione: (2025)

Good Parenting is all you need -- Multi-agentic LLM Hallucination Mitigation
di: Kwartler, Ted, et al.
Pubblicazione: (2024)

PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks
di: Shen, Guobin, et al.
Pubblicazione: (2025)

Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections
di: Maloyan, Narek, et al.
Pubblicazione: (2025)

PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization
di: Jawad, Huseein, et al.
Pubblicazione: (2025)

TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts
di: Chu, Hua-Rong, et al.
Pubblicazione: (2026)

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety
di: Arif, Samee, et al.
Pubblicazione: (2026)

Efficient Detection of Toxic Prompts in Large Language Models
di: Liu, Yi, et al.
Pubblicazione: (2024)

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts
di: Xin, Yuan, et al.
Pubblicazione: (2026)

Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution
di: Zhang, Xiaozhe, et al.
Pubblicazione: (2026)

Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals
di: Zheng, Rui, et al.
Pubblicazione: (2024)

Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening
di: Zhang, Mohan, et al.
Pubblicazione: (2026)

SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression
di: Li, Yucheng, et al.
Pubblicazione: (2025)

SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems
di: Bodea, Andreea-Elena, et al.
Pubblicazione: (2026)

Sparse Autoencoders are Capable LLM Jailbreak Mitigators
di: Assogba, Yannick, et al.
Pubblicazione: (2026)

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection
di: Sahoo, Devanshu, et al.
Pubblicazione: (2025)