Saved in:
| Main Authors: | Wei, Boyi, Huang, Kaixuan, Huang, Yangsibo, Xie, Tinghao, Qi, Xiangyu, Xia, Mengzhou, Mittal, Prateek, Wang, Mengdi, Henderson, Peter |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.05162 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
On Evaluating the Durability of Safeguards for Open-Weight LLMs
by: Qi, Xiangyu, et al.
Published: (2024)
by: Qi, Xiangyu, et al.
Published: (2024)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
by: Xie, Tinghao, et al.
Published: (2024)
by: Xie, Tinghao, et al.
Published: (2024)
An Adversarial Perspective on Machine Unlearning for AI Safety
by: Łucki, Jakub, et al.
Published: (2024)
by: Łucki, Jakub, et al.
Published: (2024)
AI Risk Management Should Incorporate Both Safety and Security
by: Qi, Xiangyu, et al.
Published: (2024)
by: Qi, Xiangyu, et al.
Published: (2024)
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
by: Qi, Xiangyu, et al.
Published: (2024)
by: Qi, Xiangyu, et al.
Published: (2024)
What is in Your Safe Data? Identifying Benign Data that Breaks Safety
by: He, Luxi, et al.
Published: (2024)
by: He, Luxi, et al.
Published: (2024)
Fantastic Copyrighted Beasts and How (Not) to Generate Them
by: He, Luxi, et al.
Published: (2024)
by: He, Luxi, et al.
Published: (2024)
Evaluating Copyright Takedown Methods for Language Models
by: Wei, Boyi, et al.
Published: (2024)
by: Wei, Boyi, et al.
Published: (2024)
The Model Hears You: Audio Language Model Deployments Should Consider the Principle of Least Privilege
by: He, Luxi, et al.
Published: (2025)
by: He, Luxi, et al.
Published: (2025)
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
by: Huang, Kaixuan, et al.
Published: (2024)
by: Huang, Kaixuan, et al.
Published: (2024)
Averaging quadratically twisted $L$-values and their derivatives
by: Huang, Tinghao
Published: (2025)
by: Huang, Tinghao
Published: (2025)
Detecting Pretraining Data from Large Language Models
by: Shi, Weijia, et al.
Published: (2023)
by: Shi, Weijia, et al.
Published: (2023)
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
by: Xia, Mengzhou, et al.
Published: (2023)
by: Xia, Mengzhou, et al.
Published: (2023)
On Ramanujan Primes for Hecke-Maass Cusp Forms
by: Huang, Tinghao, et al.
Published: (2026)
by: Huang, Tinghao, et al.
Published: (2026)
A Theoretical Perspective for Speculative Decoding Algorithm
by: Yin, Ming, et al.
Published: (2024)
by: Yin, Ming, et al.
Published: (2024)
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
by: Panda, Ashwinee, et al.
Published: (2024)
by: Panda, Ashwinee, et al.
Published: (2024)
Adaptive and Stratified Subsampling for High-Dimensional Robust Estimation
by: Mittal, Prateek, et al.
Published: (2024)
by: Mittal, Prateek, et al.
Published: (2024)
Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation
by: Wang, Dianyun, et al.
Published: (2025)
by: Wang, Dianyun, et al.
Published: (2025)
Diffusion Model for Manifold Data: Score Decomposition, Curvature, and Statistical Complexity
by: Zhang, Zixuan, et al.
Published: (2026)
by: Zhang, Zixuan, et al.
Published: (2026)
A Shear‐Deformable Extended Quasi‐Bond Method With Dual‐Mechanism Fracture Criterion for Brittle and Quasi‐Brittle Materials
by: Wei‐Tong Li, et al.
Published: (2025)
by: Wei‐Tong Li, et al.
Published: (2025)
Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model
by: Lou, Xinyue, et al.
Published: (2025)
by: Lou, Xinyue, et al.
Published: (2025)
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation
by: Li, Mingjie, et al.
Published: (2025)
by: Li, Mingjie, et al.
Published: (2025)
Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization
by: Hong, Bin, et al.
Published: (2025)
by: Hong, Bin, et al.
Published: (2025)
Jointly Training and Pruning CNNs via Learnable Agent Guidance and Alignment
by: Ganjdanesh, Alireza, et al.
Published: (2024)
by: Ganjdanesh, Alireza, et al.
Published: (2024)
PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation
by: Benedek, Nadav, et al.
Published: (2024)
by: Benedek, Nadav, et al.
Published: (2024)
LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion
by: Zhou, Guanghao, et al.
Published: (2026)
by: Zhou, Guanghao, et al.
Published: (2026)
Advancing LLM Safe Alignment with Safety Representation Ranking
by: Du, Tianqi, et al.
Published: (2025)
by: Du, Tianqi, et al.
Published: (2025)
PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches
by: Jacob, Dennis, et al.
Published: (2025)
by: Jacob, Dennis, et al.
Published: (2025)
Beyond Images: Adaptive Fusion of Visual and Textual Data for Food Classification
by: Mittal, Prateek, et al.
Published: (2023)
by: Mittal, Prateek, et al.
Published: (2023)
Efficient Data Shapley for Weighted Nearest Neighbor Algorithms
by: Wang, Jiachen T., et al.
Published: (2024)
by: Wang, Jiachen T., et al.
Published: (2024)
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment
by: Ghosal, Soumya Suvra, et al.
Published: (2024)
by: Ghosal, Soumya Suvra, et al.
Published: (2024)
Latent Diffusion Models for Controllable RNA Sequence Generation
by: Huang, Kaixuan, et al.
Published: (2024)
by: Huang, Kaixuan, et al.
Published: (2024)
Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models
by: Yang, Yukang, et al.
Published: (2025)
by: Yang, Yukang, et al.
Published: (2025)
Dynamic Risk Assessments for Offensive Cybersecurity Agents
by: Wei, Boyi, et al.
Published: (2025)
by: Wei, Boyi, et al.
Published: (2025)
Embarrassment and the Social Dimensions of Moral Agency
by: Shawn Tinghao Wang
Published: (2026)
by: Shawn Tinghao Wang
Published: (2026)
Enhancing Stability and Safety of Commercial Solid‐State Lithium Batteries Through Ternary Eutectic Solvents for Solid‐State Electrolyte Interface Modification
by: Kaixuan Zhou, et al.
Published: (2024)
by: Kaixuan Zhou, et al.
Published: (2024)
When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models
by: Choi, Dasol, et al.
Published: (2026)
by: Choi, Dasol, et al.
Published: (2026)
SimPO: Simple Preference Optimization with a Reference-Free Reward
by: Meng, Yu, et al.
Published: (2024)
by: Meng, Yu, et al.
Published: (2024)
Foundation Model Engineering: Engineering Foundation Models Just as Engineering Software
by: Ran, Dezhi, et al.
Published: (2024)
by: Ran, Dezhi, et al.
Published: (2024)
Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs
by: Ao, Shuang, et al.
Published: (2025)
by: Ao, Shuang, et al.
Published: (2025)
Similar Items
-
On Evaluating the Durability of Safeguards for Open-Weight LLMs
by: Qi, Xiangyu, et al.
Published: (2024) -
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
by: Xie, Tinghao, et al.
Published: (2024) -
An Adversarial Perspective on Machine Unlearning for AI Safety
by: Łucki, Jakub, et al.
Published: (2024) -
AI Risk Management Should Incorporate Both Safety and Security
by: Qi, Xiangyu, et al.
Published: (2024) -
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
by: Qi, Xiangyu, et al.
Published: (2024)