Saved in:
| Main Author: | Pan, Jonathan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.12286 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
In-Context Representation Hijacking
by: Yona, Itay, et al.
Published: (2025)
by: Yona, Itay, et al.
Published: (2025)
Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs
by: Kim, Jinhwa, et al.
Published: (2025)
by: Kim, Jinhwa, et al.
Published: (2025)
NeuroFilter: Privacy Guardrails for Conversational LLM Agents
by: Das, Saswat, et al.
Published: (2026)
by: Das, Saswat, et al.
Published: (2026)
Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents
by: Ngong, Ivoline, et al.
Published: (2025)
by: Ngong, Ivoline, et al.
Published: (2025)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models
by: Ying, Zonghao, et al.
Published: (2025)
by: Ying, Zonghao, et al.
Published: (2025)
Guarding Your Conversations: Privacy Gatekeepers for Secure Interactions with Cloud-Based AI Models
by: Uzor, GodsGift, et al.
Published: (2025)
by: Uzor, GodsGift, et al.
Published: (2025)
Universal and Context-Independent Triggers for Precise Control of LLM Outputs
by: Liang, Jiashuo, et al.
Published: (2024)
by: Liang, Jiashuo, et al.
Published: (2024)
Reverse-Engineering Model Editing on Language Models
by: Sun, Zhiyu, et al.
Published: (2026)
by: Sun, Zhiyu, et al.
Published: (2026)
CATMark: A Context-Aware Thresholding Framework for Robust Cross-Task Watermarking in Large Language Models
by: Zhang, Yu, et al.
Published: (2025)
by: Zhang, Yu, et al.
Published: (2025)
CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models
by: Zhou, Guanghao, et al.
Published: (2025)
by: Zhou, Guanghao, et al.
Published: (2025)
REEF: Representation Encoding Fingerprints for Large Language Models
by: Zhang, Jie, et al.
Published: (2024)
by: Zhang, Jie, et al.
Published: (2024)
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts
by: Uenal, Fatih
Published: (2026)
by: Uenal, Fatih
Published: (2026)
CREBench: Evaluating Large Language Models in Cryptographic Binary Reverse Engineering
by: Chen, Baicheng, et al.
Published: (2026)
by: Chen, Baicheng, et al.
Published: (2026)
Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning
by: Wang, Yanbo, et al.
Published: (2026)
by: Wang, Yanbo, et al.
Published: (2026)
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
by: Xu, Huiyu, et al.
Published: (2024)
by: Xu, Huiyu, et al.
Published: (2024)
MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content
by: Guo, Ruoqi, et al.
Published: (2026)
by: Guo, Ruoqi, et al.
Published: (2026)
Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
by: Xing, Wenpeng, et al.
Published: (2025)
by: Xing, Wenpeng, et al.
Published: (2025)
Shadow Unlearning: A Neuro-Semantic Approach to Fidelity-Preserving Faceless Forgetting in LLMs
by: P, Dinesh Srivasthav, et al.
Published: (2026)
by: P, Dinesh Srivasthav, et al.
Published: (2026)
Direct Token Optimization: A Self-contained Approach to Large Language Model Unlearning
by: Lee, Hong kyu, et al.
Published: (2025)
by: Lee, Hong kyu, et al.
Published: (2025)
Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs
by: D'addario, Andrew Maranhão Ventura
Published: (2025)
by: D'addario, Andrew Maranhão Ventura
Published: (2025)
Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models
by: Chu, Junjie, et al.
Published: (2024)
by: Chu, Junjie, et al.
Published: (2024)
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs
by: Lv, Lijia, et al.
Published: (2024)
by: Lv, Lijia, et al.
Published: (2024)
Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning
by: Zhao, Shuai, et al.
Published: (2024)
by: Zhao, Shuai, et al.
Published: (2024)
A Survey of Recent Backdoor Attacks and Defenses in Large Language Models
by: Zhao, Shuai, et al.
Published: (2024)
by: Zhao, Shuai, et al.
Published: (2024)
DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning
by: Zhang, Junbo, et al.
Published: (2026)
by: Zhang, Junbo, et al.
Published: (2026)
PRISON: Unmasking the Criminal Potential of Large Language Models
by: Wu, Xinyi, et al.
Published: (2025)
by: Wu, Xinyi, et al.
Published: (2025)
Can LLMs Infer Conversational Agent Users' Personality Traits from Chat History?
by: Cögendez, Derya, et al.
Published: (2026)
by: Cögendez, Derya, et al.
Published: (2026)
The Earth is Flat because...: Investigating LLMs' Belief towards Misinformation via Persuasive Conversation
by: Xu, Rongwu, et al.
Published: (2023)
by: Xu, Rongwu, et al.
Published: (2023)
SWAN: Semantic Watermarking with Abstract Meaning Representation
by: Ye, Ziping, et al.
Published: (2026)
by: Ye, Ziping, et al.
Published: (2026)
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
by: Yu, Miao, et al.
Published: (2024)
by: Yu, Miao, et al.
Published: (2024)
Beyond Context: Large Language Models' Failure to Grasp Users' Intent
by: Hussain, Ahmed M., et al.
Published: (2025)
by: Hussain, Ahmed M., et al.
Published: (2025)
Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
by: Zhu, Xiaoyuan, et al.
Published: (2025)
by: Zhu, Xiaoyuan, et al.
Published: (2025)
Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models
by: Tu, Shangqing, et al.
Published: (2024)
by: Tu, Shangqing, et al.
Published: (2024)
Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models
by: Zhang, Yihao, et al.
Published: (2024)
by: Zhang, Yihao, et al.
Published: (2024)
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
by: Shen, Xinjie, et al.
Published: (2026)
by: Shen, Xinjie, et al.
Published: (2026)
Federated In-Context LLM Agent Learning
by: Wu, Panlong, et al.
Published: (2024)
by: Wu, Panlong, et al.
Published: (2024)
Securing Multi-turn Conversational Language Models From Distributed Backdoor Triggers
by: Tong, Terry, et al.
Published: (2024)
by: Tong, Terry, et al.
Published: (2024)
How Private is Your Attention? Bridging Privacy with In-Context Learning
by: Bonnerjee, Soham, et al.
Published: (2025)
by: Bonnerjee, Soham, et al.
Published: (2025)
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
by: Wei, Zeming, et al.
Published: (2023)
by: Wei, Zeming, et al.
Published: (2023)
Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval
by: Chen, Taiye, et al.
Published: (2025)
by: Chen, Taiye, et al.
Published: (2025)
Similar Items
-
In-Context Representation Hijacking
by: Yona, Itay, et al.
Published: (2025) -
Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs
by: Kim, Jinhwa, et al.
Published: (2025) -
NeuroFilter: Privacy Guardrails for Conversational LLM Agents
by: Das, Saswat, et al.
Published: (2026) -
Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents
by: Ngong, Ivoline, et al.
Published: (2025) -
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models
by: Ying, Zonghao, et al.
Published: (2025)