Saved in:
| Main Authors: | Zhang, Hanxiu, Zheng, Yue |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.03620 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation
by: Chen, Sixu, et al.
Published: (2026)
by: Chen, Sixu, et al.
Published: (2026)
BinaryShield: Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints
by: Gill, Waris, et al.
Published: (2025)
by: Gill, Waris, et al.
Published: (2025)
Instructional Fingerprinting of Large Language Models
by: Xu, Jiashu, et al.
Published: (2024)
by: Xu, Jiashu, et al.
Published: (2024)
Are Robust LLM Fingerprints Adversarially Robust?
by: Nasery, Anshul, et al.
Published: (2025)
by: Nasery, Anshul, et al.
Published: (2025)
A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens
by: Dobre, David, et al.
Published: (2025)
by: Dobre, David, et al.
Published: (2025)
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
by: Cao, Bochuan, et al.
Published: (2023)
by: Cao, Bochuan, et al.
Published: (2023)
Tuning without Peeking: Provable Generalization Bounds and Robust LLM Post-Training
by: Labiad, Ismail, et al.
Published: (2025)
by: Labiad, Ismail, et al.
Published: (2025)
Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses
by: Ahmed, Mohamed, et al.
Published: (2025)
by: Ahmed, Mohamed, et al.
Published: (2025)
LLMs Have Rhythm: Fingerprinting Large Language Models Using Inter-Token Times and Network Traffic Analysis
by: Alhazbi, Saeif, et al.
Published: (2025)
by: Alhazbi, Saeif, et al.
Published: (2025)
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
by: Zheng, Xiaosen, et al.
Published: (2024)
by: Zheng, Xiaosen, et al.
Published: (2024)
A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment
by: Wang, Kun, et al.
Published: (2025)
by: Wang, Kun, et al.
Published: (2025)
BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents
by: Wang, Yifei, et al.
Published: (2024)
by: Wang, Yifei, et al.
Published: (2024)
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
by: Zhang, Haoyu, et al.
Published: (2026)
by: Zhang, Haoyu, et al.
Published: (2026)
SVIP: Towards Verifiable Inference of Open-source Large Language Models
by: Sun, Yifan, et al.
Published: (2024)
by: Sun, Yifan, et al.
Published: (2024)
Linking Cryptoasset Attribution Tags to Knowledge Graph Entities: An LLM-based Approach
by: Avice, Régnier, et al.
Published: (2025)
by: Avice, Régnier, et al.
Published: (2025)
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
by: Wu, Tong, et al.
Published: (2024)
by: Wu, Tong, et al.
Published: (2024)
PostMark: A Robust Blackbox Watermark for Large Language Models
by: Chang, Yapei, et al.
Published: (2024)
by: Chang, Yapei, et al.
Published: (2024)
Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models
by: Yuan, Hongbang, et al.
Published: (2024)
by: Yuan, Hongbang, et al.
Published: (2024)
AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing
by: Li, Yuexin, et al.
Published: (2026)
by: Li, Yuexin, et al.
Published: (2026)
From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning
by: Xu, Xiaoyu, et al.
Published: (2026)
by: Xu, Xiaoyu, et al.
Published: (2026)
Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening
by: Zhang, Mohan, et al.
Published: (2026)
by: Zhang, Mohan, et al.
Published: (2026)
Towards Understanding the Robustness of Sparse Autoencoders
by: Saiyed, Ahson, et al.
Published: (2026)
by: Saiyed, Ahson, et al.
Published: (2026)
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
by: Xu, Xiaoyu, et al.
Published: (2025)
by: Xu, Xiaoyu, et al.
Published: (2025)
STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
by: Li, Jing-Jing, et al.
Published: (2025)
by: Li, Jing-Jing, et al.
Published: (2025)
Federated In-Context LLM Agent Learning
by: Wu, Panlong, et al.
Published: (2024)
by: Wu, Panlong, et al.
Published: (2024)
Exploring the Robustness of In-Context Learning with Noisy Labels
by: Cheng, Chen, et al.
Published: (2024)
by: Cheng, Chen, et al.
Published: (2024)
Towards Building a Robust Toxicity Predictor
by: Bespalov, Dmitriy, et al.
Published: (2024)
by: Bespalov, Dmitriy, et al.
Published: (2024)
On Adversarial Robustness of Language Models in Transfer Learning
by: Turbal, Bohdan, et al.
Published: (2024)
by: Turbal, Bohdan, et al.
Published: (2024)
AdvPrefix: An Objective for Nuanced LLM Jailbreaks
by: Zhu, Sicheng, et al.
Published: (2024)
by: Zhu, Sicheng, et al.
Published: (2024)
Policy-Invisible Violations in LLM-Based Agents
by: Wu, Jie, et al.
Published: (2026)
by: Wu, Jie, et al.
Published: (2026)
Certifying LLM Safety against Adversarial Prompting
by: Kumar, Aounon, et al.
Published: (2023)
by: Kumar, Aounon, et al.
Published: (2023)
GaussMark: A Practical Approach for Structural Watermarking of Language Models
by: Block, Adam, et al.
Published: (2025)
by: Block, Adam, et al.
Published: (2025)
Directional Embedding Smoothing for Robust Vision Language Models
by: Wang, Ye, et al.
Published: (2026)
by: Wang, Ye, et al.
Published: (2026)
Adversarial Text Purification: A Large Language Model Approach for Defense
by: Moraffah, Raha, et al.
Published: (2024)
by: Moraffah, Raha, et al.
Published: (2024)
HSF: Defending against Jailbreak Attacks with Hidden State Filtering
by: Qian, Cheng, et al.
Published: (2024)
by: Qian, Cheng, et al.
Published: (2024)
Efficient LLM Moderation with Multi-Layer Latent Prototypes
by: Chrabąszcz, Maciej, et al.
Published: (2025)
by: Chrabąszcz, Maciej, et al.
Published: (2025)
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
by: Halawi, Danny, et al.
Published: (2024)
by: Halawi, Danny, et al.
Published: (2024)
Adaptive Instruction Composition for Automated LLM Red-Teaming
by: Zymet, Jesse, et al.
Published: (2026)
by: Zymet, Jesse, et al.
Published: (2026)
OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models
by: Xu, Xiaoyu, et al.
Published: (2025)
by: Xu, Xiaoyu, et al.
Published: (2025)
Probing the Robustness of Large Language Models Safety to Latent Perturbations
by: Gu, Tianle, et al.
Published: (2025)
by: Gu, Tianle, et al.
Published: (2025)
Similar Items
-
Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation
by: Chen, Sixu, et al.
Published: (2026) -
BinaryShield: Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints
by: Gill, Waris, et al.
Published: (2025) -
Instructional Fingerprinting of Large Language Models
by: Xu, Jiashu, et al.
Published: (2024) -
Are Robust LLM Fingerprints Adversarially Robust?
by: Nasery, Anshul, et al.
Published: (2025) -
A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens
by: Dobre, David, et al.
Published: (2025)