:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Norelli, Antonio, Bronstein, Michael
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Computation and Language Cryptography and Security Machine Learning
Online Access:	https://arxiv.org/abs/2510.20075
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025)

LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
by: Lin, Shi, et al.
Published: (2024)

De-identification of clinical free text using natural language processing: A systematic review of current approaches
by: Kovačević, Aleksandar, et al.
Published: (2023)

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
by: Dubiński, Jan, et al.
Published: (2026)

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
by: Chu, Junjie, et al.
Published: (2024)

Jailbreaking LLMs via Calibration
by: Lu, Yuxuan, et al.
Published: (2026)

Tool Preferences in Agentic LLMs are Unreliable
by: Faghih, Kazem, et al.
Published: (2025)

Gandalf the Red: Adaptive Security for LLMs
by: Pfister, Niklas, et al.
Published: (2025)

Private prediction for large-scale synthetic text generation
by: Amin, Kareem, et al.
Published: (2024)

Shh, don't say that! Domain Certification in LLMs
by: Emde, Cornelius, et al.
Published: (2025)

Early Signs of Steganographic Capabilities in Frontier LLMs
by: Zolkowski, Artur, et al.
Published: (2025)

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
by: Mehrotra, Anay, et al.
Published: (2023)

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
by: Paulus, Anselm, et al.
Published: (2024)

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs
by: Betley, Jan, et al.
Published: (2025)

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
by: Xu, Xiaoyu, et al.
Published: (2025)

Tell me about yourself: LLMs are aware of their learned behaviors
by: Betley, Jan, et al.
Published: (2025)

Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
by: Vega, Jason, et al.
Published: (2023)

Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment
by: Shao, Zedian, et al.
Published: (2024)

HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection
by: Wang, Yuxin, et al.
Published: (2024)

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
by: Rando, Javier, et al.
Published: (2024)

Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification
by: Bhattacharjee, Payel, et al.
Published: (2025)

Time Travel in LLMs: Tracing Data Contamination in Large Language Models
by: Golchin, Shahriar, et al.
Published: (2023)

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
by: Kan, Chun Yan Ryan, et al.
Published: (2026)

Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks
by: Poppi, Samuele, et al.
Published: (2024)

Teach LLMs to Phish: Stealing Private Information from Language Models
by: Panda, Ashwinee, et al.
Published: (2024)

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
by: Wang, Kai, et al.
Published: (2025)

KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs
by: Liang, Buyun, et al.
Published: (2025)

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization
by: Geng, Runpeng, et al.
Published: (2025)

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs
by: Guo, Chuan, et al.
Published: (2026)

Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
by: Hasan, Adib, et al.
Published: (2024)

How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis
by: Mostafa, Ahmed, et al.
Published: (2025)

FedMentor: Domain-Aware Differential Privacy for Heterogeneous Federated LLMs in Mental Health
by: Sarwar, Nobin, et al.
Published: (2025)

ObfuscaTune: Obfuscated Offsite Fine-tuning and Inference of Proprietary LLMs on Private Datasets
by: Frikha, Ahmed, et al.
Published: (2024)

Toward a Safer Web: Multilingual Multi-Agent LLMs for Mitigating Adversarial Misinformation Attacks
by: Aldahoul, Nouar, et al.
Published: (2025)

SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
by: Muhamed, Aashiq, et al.
Published: (2025)

LLMs Have Rhythm: Fingerprinting Large Language Models Using Inter-Token Times and Network Traffic Analysis
by: Alhazbi, Saeif, et al.
Published: (2025)

Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens
by: Salahuddin, Salahuddin, et al.
Published: (2025)

Rethinking How to Evaluate Language Model Jailbreak
by: Cai, Hongyu, et al.
Published: (2024)

In-Context Representation Hijacking
by: Yona, Itay, et al.
Published: (2025)

Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models
by: Chu, Junjie, et al.
Published: (2024)