Saved in:
| Main Authors: | Li, Wei, Zhu, Luyao, Song, Yang, Lin, Ruixi, Mao, Rui, You, Yang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.09181 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Attack and defense techniques in large language models: A survey and new perspectives
by: Liao, Zhiyu, et al.
Published: (2025)
by: Liao, Zhiyu, et al.
Published: (2025)
An In-Depth Investigation of Data Collection in LLM App Ecosystems
by: Wu, Yuhao, et al.
Published: (2024)
by: Wu, Yuhao, et al.
Published: (2024)
Generative AI Security: Challenges and Countermeasures
by: Zhu, Banghua, et al.
Published: (2024)
by: Zhu, Banghua, et al.
Published: (2024)
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
by: Zhang, Andy K., et al.
Published: (2024)
by: Zhang, Andy K., et al.
Published: (2024)
LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI's ChatGPT Plugins
by: Iqbal, Umar, et al.
Published: (2023)
by: Iqbal, Umar, et al.
Published: (2023)
Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots
by: Shayegani, Erfan, et al.
Published: (2025)
by: Shayegani, Erfan, et al.
Published: (2025)
Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness
by: Shayegani, Erfan, et al.
Published: (2025)
by: Shayegani, Erfan, et al.
Published: (2025)
Urania: Differentially Private Insights into AI Use
by: Liu, Daogao, et al.
Published: (2025)
by: Liu, Daogao, et al.
Published: (2025)
Clio: Privacy-Preserving Insights into Real-World AI Use
by: Tamkin, Alex, et al.
Published: (2024)
by: Tamkin, Alex, et al.
Published: (2024)
Superficial Safety Alignment Hypothesis
by: Li, Jianwei, et al.
Published: (2024)
by: Li, Jianwei, et al.
Published: (2024)
What Makes an Evaluation Useful? Common Pitfalls and Best Practices
by: Gekker, Gil, et al.
Published: (2025)
by: Gekker, Gil, et al.
Published: (2025)
IsolateGPT: An Execution Isolation Architecture for LLM-Based Agentic Systems
by: Wu, Yuhao, et al.
Published: (2024)
by: Wu, Yuhao, et al.
Published: (2024)
Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
by: Jang, Yeonwoo, et al.
Published: (2025)
by: Jang, Yeonwoo, et al.
Published: (2025)
Privacy at a Price: Exploring its Dual Impact on AI Fairness
by: Yang, Mengmeng, et al.
Published: (2024)
by: Yang, Mengmeng, et al.
Published: (2024)
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
by: Zheng, Xiaosen, et al.
Published: (2024)
by: Zheng, Xiaosen, et al.
Published: (2024)
TaeBench: Improving Quality of Toxic Adversarial Examples
by: Zhu, Xuan, et al.
Published: (2024)
by: Zhu, Xuan, et al.
Published: (2024)
Optimizing watermarks for large language models
by: Wouters, Bram
Published: (2023)
by: Wouters, Bram
Published: (2023)
HeavyWater and SimplexWater: Distortion-Free LLM Watermarks for Low-Entropy Next-Token Predictions
by: Tsur, Dor, et al.
Published: (2025)
by: Tsur, Dor, et al.
Published: (2025)
The Earth is Flat because...: Investigating LLMs' Belief towards Misinformation via Persuasive Conversation
by: Xu, Rongwu, et al.
Published: (2023)
by: Xu, Rongwu, et al.
Published: (2023)
BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?
by: Jiang, Fengqing, et al.
Published: (2025)
by: Jiang, Fengqing, et al.
Published: (2025)
Safety Alignment Can Be Not Superficial With Explicit Safety Signals
by: Li, Jianwei, et al.
Published: (2025)
by: Li, Jianwei, et al.
Published: (2025)
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
by: Lin, Shi, et al.
Published: (2024)
by: Lin, Shi, et al.
Published: (2024)
A Public Theory of Distillation Resistance via Constraint-Coupled Reasoning Architectures
by: Wei, Peng, et al.
Published: (2026)
by: Wei, Peng, et al.
Published: (2026)
Can LLMs Infer Conversational Agent Users' Personality Traits from Chat History?
by: Cögendez, Derya, et al.
Published: (2026)
by: Cögendez, Derya, et al.
Published: (2026)
Improving Your Model Ranking on Chatbot Arena by Vote Rigging
by: Min, Rui, et al.
Published: (2025)
by: Min, Rui, et al.
Published: (2025)
How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?
by: Li, Yuxuan, et al.
Published: (2026)
by: Li, Yuxuan, et al.
Published: (2026)
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
by: Hu, Xulin, et al.
Published: (2026)
by: Hu, Xulin, et al.
Published: (2026)
Watermarking Should Be Treated as a Monitoring Primitive
by: Aremu, Toluwani, et al.
Published: (2026)
by: Aremu, Toluwani, et al.
Published: (2026)
SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
by: Muhamed, Aashiq, et al.
Published: (2025)
by: Muhamed, Aashiq, et al.
Published: (2025)
Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models
by: Bai, Yang, et al.
Published: (2024)
by: Bai, Yang, et al.
Published: (2024)
SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
by: Saiem, Bijoy Ahmed, et al.
Published: (2024)
by: Saiem, Bijoy Ahmed, et al.
Published: (2024)
Detecting Training Data of Large Language Models via Expectation Maximization
by: Kim, Gyuwan, et al.
Published: (2024)
by: Kim, Gyuwan, et al.
Published: (2024)
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
by: Chu, Junjie, et al.
Published: (2024)
by: Chu, Junjie, et al.
Published: (2024)
The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks
by: Chen, Xiaoyi, et al.
Published: (2023)
by: Chen, Xiaoyi, et al.
Published: (2023)
Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
by: Baumgärtner, Tim, et al.
Published: (2024)
by: Baumgärtner, Tim, et al.
Published: (2024)
Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models
by: Chu, Junjie, et al.
Published: (2024)
by: Chu, Junjie, et al.
Published: (2024)
Learnable Privacy Neurons Localization in Language Models
by: Chen, Ruizhe, et al.
Published: (2024)
by: Chen, Ruizhe, et al.
Published: (2024)
Probing the Robustness of Large Language Models Safety to Latent Perturbations
by: Gu, Tianle, et al.
Published: (2025)
by: Gu, Tianle, et al.
Published: (2025)
Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses
by: Yang, Xiaoxue, et al.
Published: (2025)
by: Yang, Xiaoxue, et al.
Published: (2025)
KnowPhish: Large Language Models Meet Multimodal Knowledge Graphs for Enhancing Reference-Based Phishing Detection
by: Li, Yuexin, et al.
Published: (2024)
by: Li, Yuexin, et al.
Published: (2024)
Similar Items
-
Attack and defense techniques in large language models: A survey and new perspectives
by: Liao, Zhiyu, et al.
Published: (2025) -
An In-Depth Investigation of Data Collection in LLM App Ecosystems
by: Wu, Yuhao, et al.
Published: (2024) -
Generative AI Security: Challenges and Countermeasures
by: Zhu, Banghua, et al.
Published: (2024) -
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
by: Zhang, Andy K., et al.
Published: (2024) -
LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI's ChatGPT Plugins
by: Iqbal, Umar, et al.
Published: (2023)