:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Wei, Zhu, Luyao, Song, Yang, Lin, Ruixi, Mao, Rui, You, Yang
Format:	Preprint
Published:	2024
Subjects:	Cryptography and Security Artificial Intelligence Computation and Language Computers and Society Machine Learning
Online Access:	https://arxiv.org/abs/2410.09181
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Attack and defense techniques in large language models: A survey and new perspectives
by: Liao, Zhiyu, et al.
Published: (2025)

An In-Depth Investigation of Data Collection in LLM App Ecosystems
by: Wu, Yuhao, et al.
Published: (2024)

Generative AI Security: Challenges and Countermeasures
by: Zhu, Banghua, et al.
Published: (2024)

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
by: Zhang, Andy K., et al.
Published: (2024)

LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI's ChatGPT Plugins
by: Iqbal, Umar, et al.
Published: (2023)

Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots
by: Shayegani, Erfan, et al.
Published: (2025)

Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness
by: Shayegani, Erfan, et al.
Published: (2025)

Urania: Differentially Private Insights into AI Use
by: Liu, Daogao, et al.
Published: (2025)

Clio: Privacy-Preserving Insights into Real-World AI Use
by: Tamkin, Alex, et al.
Published: (2024)

Superficial Safety Alignment Hypothesis
by: Li, Jianwei, et al.
Published: (2024)

What Makes an Evaluation Useful? Common Pitfalls and Best Practices
by: Gekker, Gil, et al.
Published: (2025)

IsolateGPT: An Execution Isolation Architecture for LLM-Based Agentic Systems
by: Wu, Yuhao, et al.
Published: (2024)

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
by: Jang, Yeonwoo, et al.
Published: (2025)

Privacy at a Price: Exploring its Dual Impact on AI Fairness
by: Yang, Mengmeng, et al.
Published: (2024)

Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
by: Zheng, Xiaosen, et al.
Published: (2024)

TaeBench: Improving Quality of Toxic Adversarial Examples
by: Zhu, Xuan, et al.
Published: (2024)

Optimizing watermarks for large language models
by: Wouters, Bram
Published: (2023)

HeavyWater and SimplexWater: Distortion-Free LLM Watermarks for Low-Entropy Next-Token Predictions
by: Tsur, Dor, et al.
Published: (2025)

The Earth is Flat because...: Investigating LLMs' Belief towards Misinformation via Persuasive Conversation
by: Xu, Rongwu, et al.
Published: (2023)

BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?
by: Jiang, Fengqing, et al.
Published: (2025)

Safety Alignment Can Be Not Superficial With Explicit Safety Signals
by: Li, Jianwei, et al.
Published: (2025)

LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models
by: Lin, Shi, et al.
Published: (2024)

A Public Theory of Distillation Resistance via Constraint-Coupled Reasoning Architectures
by: Wei, Peng, et al.
Published: (2026)

Can LLMs Infer Conversational Agent Users' Personality Traits from Chat History?
by: Cögendez, Derya, et al.
Published: (2026)

Improving Your Model Ranking on Chatbot Arena by Vote Rigging
by: Min, Rui, et al.
Published: (2025)

How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?
by: Li, Yuxuan, et al.
Published: (2026)

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
by: Hu, Xulin, et al.
Published: (2026)

Watermarking Should Be Treated as a Monitoring Primitive
by: Aremu, Toluwani, et al.
Published: (2026)

SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
by: Muhamed, Aashiq, et al.
Published: (2025)

Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models
by: Bai, Yang, et al.
Published: (2024)

SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
by: Saiem, Bijoy Ahmed, et al.
Published: (2024)

Detecting Training Data of Large Language Models via Expectation Maximization
by: Kim, Gyuwan, et al.
Published: (2024)

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
by: Chu, Junjie, et al.
Published: (2024)

The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks
by: Chen, Xiaoyi, et al.
Published: (2023)

Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
by: Baumgärtner, Tim, et al.
Published: (2024)

Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models
by: Chu, Junjie, et al.
Published: (2024)

Learnable Privacy Neurons Localization in Language Models
by: Chen, Ruizhe, et al.
Published: (2024)

Probing the Robustness of Large Language Models Safety to Latent Perturbations
by: Gu, Tianle, et al.
Published: (2025)

Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses
by: Yang, Xiaoxue, et al.
Published: (2025)

KnowPhish: Large Language Models Meet Multimodal Knowledge Graphs for Enhancing Reference-Based Phishing Detection
by: Li, Yuexin, et al.
Published: (2024)