Saved in:
| Main Authors: | Norelli, Antonio, Bronstein, Michael |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.20075 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025)
by: Betley, Jan, et al.
Published: (2025)
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
by: Lin, Shi, et al.
Published: (2024)
by: Lin, Shi, et al.
Published: (2024)
De-identification of clinical free text using natural language processing: A systematic review of current approaches
by: Kovačević, Aleksandar, et al.
Published: (2023)
by: Kovačević, Aleksandar, et al.
Published: (2023)
Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
by: Dubiński, Jan, et al.
Published: (2026)
by: Dubiński, Jan, et al.
Published: (2026)
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
by: Chu, Junjie, et al.
Published: (2024)
by: Chu, Junjie, et al.
Published: (2024)
Jailbreaking LLMs via Calibration
by: Lu, Yuxuan, et al.
Published: (2026)
by: Lu, Yuxuan, et al.
Published: (2026)
Tool Preferences in Agentic LLMs are Unreliable
by: Faghih, Kazem, et al.
Published: (2025)
by: Faghih, Kazem, et al.
Published: (2025)
Gandalf the Red: Adaptive Security for LLMs
by: Pfister, Niklas, et al.
Published: (2025)
by: Pfister, Niklas, et al.
Published: (2025)
Private prediction for large-scale synthetic text generation
by: Amin, Kareem, et al.
Published: (2024)
by: Amin, Kareem, et al.
Published: (2024)
Shh, don't say that! Domain Certification in LLMs
by: Emde, Cornelius, et al.
Published: (2025)
by: Emde, Cornelius, et al.
Published: (2025)
Early Signs of Steganographic Capabilities in Frontier LLMs
by: Zolkowski, Artur, et al.
Published: (2025)
by: Zolkowski, Artur, et al.
Published: (2025)
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
by: Mehrotra, Anay, et al.
Published: (2023)
by: Mehrotra, Anay, et al.
Published: (2023)
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
by: Paulus, Anselm, et al.
Published: (2024)
by: Paulus, Anselm, et al.
Published: (2024)
Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs
by: Betley, Jan, et al.
Published: (2025)
by: Betley, Jan, et al.
Published: (2025)
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
by: Xu, Xiaoyu, et al.
Published: (2025)
by: Xu, Xiaoyu, et al.
Published: (2025)
Tell me about yourself: LLMs are aware of their learned behaviors
by: Betley, Jan, et al.
Published: (2025)
by: Betley, Jan, et al.
Published: (2025)
Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
by: Vega, Jason, et al.
Published: (2023)
by: Vega, Jason, et al.
Published: (2023)
Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment
by: Shao, Zedian, et al.
Published: (2024)
by: Shao, Zedian, et al.
Published: (2024)
HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection
by: Wang, Yuxin, et al.
Published: (2024)
by: Wang, Yuxin, et al.
Published: (2024)
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
by: Rando, Javier, et al.
Published: (2024)
by: Rando, Javier, et al.
Published: (2024)
Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification
by: Bhattacharjee, Payel, et al.
Published: (2025)
by: Bhattacharjee, Payel, et al.
Published: (2025)
Time Travel in LLMs: Tracing Data Contamination in Large Language Models
by: Golchin, Shahriar, et al.
Published: (2023)
by: Golchin, Shahriar, et al.
Published: (2023)
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
by: Kan, Chun Yan Ryan, et al.
Published: (2026)
by: Kan, Chun Yan Ryan, et al.
Published: (2026)
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks
by: Poppi, Samuele, et al.
Published: (2024)
by: Poppi, Samuele, et al.
Published: (2024)
Teach LLMs to Phish: Stealing Private Information from Language Models
by: Panda, Ashwinee, et al.
Published: (2024)
by: Panda, Ashwinee, et al.
Published: (2024)
When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
by: Wang, Kai, et al.
Published: (2025)
by: Wang, Kai, et al.
Published: (2025)
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs
by: Liang, Buyun, et al.
Published: (2025)
by: Liang, Buyun, et al.
Published: (2025)
PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization
by: Geng, Runpeng, et al.
Published: (2025)
by: Geng, Runpeng, et al.
Published: (2025)
IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs
by: Guo, Chuan, et al.
Published: (2026)
by: Guo, Chuan, et al.
Published: (2026)
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
by: Hasan, Adib, et al.
Published: (2024)
by: Hasan, Adib, et al.
Published: (2024)
How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis
by: Mostafa, Ahmed, et al.
Published: (2025)
by: Mostafa, Ahmed, et al.
Published: (2025)
FedMentor: Domain-Aware Differential Privacy for Heterogeneous Federated LLMs in Mental Health
by: Sarwar, Nobin, et al.
Published: (2025)
by: Sarwar, Nobin, et al.
Published: (2025)
ObfuscaTune: Obfuscated Offsite Fine-tuning and Inference of Proprietary LLMs on Private Datasets
by: Frikha, Ahmed, et al.
Published: (2024)
by: Frikha, Ahmed, et al.
Published: (2024)
Toward a Safer Web: Multilingual Multi-Agent LLMs for Mitigating Adversarial Misinformation Attacks
by: Aldahoul, Nouar, et al.
Published: (2025)
by: Aldahoul, Nouar, et al.
Published: (2025)
SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
by: Muhamed, Aashiq, et al.
Published: (2025)
by: Muhamed, Aashiq, et al.
Published: (2025)
LLMs Have Rhythm: Fingerprinting Large Language Models Using Inter-Token Times and Network Traffic Analysis
by: Alhazbi, Saeif, et al.
Published: (2025)
by: Alhazbi, Saeif, et al.
Published: (2025)
Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens
by: Salahuddin, Salahuddin, et al.
Published: (2025)
by: Salahuddin, Salahuddin, et al.
Published: (2025)
Rethinking How to Evaluate Language Model Jailbreak
by: Cai, Hongyu, et al.
Published: (2024)
by: Cai, Hongyu, et al.
Published: (2024)
In-Context Representation Hijacking
by: Yona, Itay, et al.
Published: (2025)
by: Yona, Itay, et al.
Published: (2025)
Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models
by: Chu, Junjie, et al.
Published: (2024)
by: Chu, Junjie, et al.
Published: (2024)
Similar Items
-
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025) -
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
by: Lin, Shi, et al.
Published: (2024) -
De-identification of clinical free text using natural language processing: A systematic review of current approaches
by: Kovačević, Aleksandar, et al.
Published: (2023) -
Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
by: Dubiński, Jan, et al.
Published: (2026) -
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
by: Chu, Junjie, et al.
Published: (2024)