Saved in:
| Main Authors: | Zhang, Xiaoyun, Zhao, Zhengyue, Shi, Wenxuan, Xu, Kaidi, Huang, Di, Hu, Xing |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.16743 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching
by: Zhao, Weixiang, et al.
Published: (2024)
by: Zhao, Weixiang, et al.
Published: (2024)
Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack
by: Huang, Tiansheng, et al.
Published: (2024)
by: Huang, Tiansheng, et al.
Published: (2024)
Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT
by: Yu, Le, et al.
Published: (2025)
by: Yu, Le, et al.
Published: (2025)
Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation
by: Huang, Tiansheng, et al.
Published: (2024)
by: Huang, Tiansheng, et al.
Published: (2024)
MPO: Multilingual Safety Alignment via Reward Gap Optimization
by: Zhao, Weixiang, et al.
Published: (2025)
by: Zhao, Weixiang, et al.
Published: (2025)
Language of Thought Shapes Output Diversity in Large Language Models
by: Xu, Shaoyang, et al.
Published: (2026)
by: Xu, Shaoyang, et al.
Published: (2026)
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
by: Banerjee, Somnath, et al.
Published: (2024)
by: Banerjee, Somnath, et al.
Published: (2024)
Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model
by: Lou, Xinyue, et al.
Published: (2025)
by: Lou, Xinyue, et al.
Published: (2025)
Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model
by: Wang, Siyin, et al.
Published: (2024)
by: Wang, Siyin, et al.
Published: (2024)
SafeLawBench: Towards Safe Alignment of Large Language Models
by: Cao, Chuxue, et al.
Published: (2025)
by: Cao, Chuxue, et al.
Published: (2025)
ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models
by: Zhang, Hengxiang, et al.
Published: (2024)
by: Zhang, Hengxiang, et al.
Published: (2024)
SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks
by: Li, Tianhao, et al.
Published: (2024)
by: Li, Tianhao, et al.
Published: (2024)
Contrastive Knowledge Transfer and Robust Optimization for Secure Alignment of Large Language Models
by: Zheng, Jiasen, et al.
Published: (2025)
by: Zheng, Jiasen, et al.
Published: (2025)
SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations
by: Huang, Shuai, et al.
Published: (2025)
by: Huang, Shuai, et al.
Published: (2025)
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
by: Djuhera, Aladin, et al.
Published: (2025)
by: Djuhera, Aladin, et al.
Published: (2025)
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
by: Huang, Caishuang, et al.
Published: (2024)
by: Huang, Caishuang, et al.
Published: (2024)
All Languages Matter: On the Multilingual Safety of Large Language Models
by: Wang, Wenxuan, et al.
Published: (2023)
by: Wang, Wenxuan, et al.
Published: (2023)
SafeWorld: Geo-Diverse Safety Alignment
by: Yin, Da, et al.
Published: (2024)
by: Yin, Da, et al.
Published: (2024)
Improving the Robustness of Large Language Models via Consistency Alignment
by: Zhao, Yukun, et al.
Published: (2024)
by: Zhao, Yukun, et al.
Published: (2024)
Pardon? Evaluating Conversational Repair in Large Audio-Language Models
by: Huang, Shuanghong, et al.
Published: (2026)
by: Huang, Shuanghong, et al.
Published: (2026)
DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
by: Hu, Wenhao, et al.
Published: (2025)
by: Hu, Wenhao, et al.
Published: (2025)
IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation
by: Fan, Haozhi, et al.
Published: (2026)
by: Fan, Haozhi, et al.
Published: (2026)
Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation
by: Zhang, Zhibo, et al.
Published: (2025)
by: Zhang, Zhibo, et al.
Published: (2025)
From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment
by: Huang, Chongxuan, et al.
Published: (2025)
by: Huang, Chongxuan, et al.
Published: (2025)
NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning
by: Yi, Xin, et al.
Published: (2024)
by: Yi, Xin, et al.
Published: (2024)
SConU: Selective Conformal Uncertainty in Large Language Models
by: Wang, Zhiyuan, et al.
Published: (2025)
by: Wang, Zhiyuan, et al.
Published: (2025)
VaccineRAG: Boosting Multimodal Large Language Models' Immunity to Harmful RAG Samples
by: Sun, Qixin, et al.
Published: (2025)
by: Sun, Qixin, et al.
Published: (2025)
AdaCultureSafe: Adaptive Cultural Safety Grounded by Cultural Knowledge in Large Language Models
by: Kang, Hankun, et al.
Published: (2026)
by: Kang, Hankun, et al.
Published: (2026)
Enhancing Safety of Large Language Models via Embedding Space Separation
by: Zhao, Xu, et al.
Published: (2026)
by: Zhao, Xu, et al.
Published: (2026)
Advancing LLM Safe Alignment with Safety Representation Ranking
by: Du, Tianqi, et al.
Published: (2025)
by: Du, Tianqi, et al.
Published: (2025)
One-Shot Safety Alignment for Large Language Models via Optimal Dualization
by: Huang, Xinmeng, et al.
Published: (2024)
by: Huang, Xinmeng, et al.
Published: (2024)
Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling
by: Deng, Qiyuan, et al.
Published: (2025)
by: Deng, Qiyuan, et al.
Published: (2025)
Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding
by: Xu, Derong, et al.
Published: (2024)
by: Xu, Derong, et al.
Published: (2024)
Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings
by: Huang, Yue, et al.
Published: (2024)
by: Huang, Yue, et al.
Published: (2024)
SafetyBench: Evaluating the Safety of Large Language Models
by: Zhang, Zhexin, et al.
Published: (2023)
by: Zhang, Zhexin, et al.
Published: (2023)
Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning
by: Zhang, Zongmeng, et al.
Published: (2024)
by: Zhang, Zongmeng, et al.
Published: (2024)
Self-HarmLLM: Can Large Language Model Harm Itself?
by: Kim, Heehwan, et al.
Published: (2025)
by: Kim, Heehwan, et al.
Published: (2025)
On Almost Surely Safe Alignment of Large Language Models at Inference-Time
by: Ji, Xiaotong, et al.
Published: (2025)
by: Ji, Xiaotong, et al.
Published: (2025)
ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
by: Chen, Hao, et al.
Published: (2025)
by: Chen, Hao, et al.
Published: (2025)
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
by: Yuan, Youliang, et al.
Published: (2023)
by: Yuan, Youliang, et al.
Published: (2023)
Similar Items
-
Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching
by: Zhao, Weixiang, et al.
Published: (2024) -
Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack
by: Huang, Tiansheng, et al.
Published: (2024) -
Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT
by: Yu, Le, et al.
Published: (2025) -
Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation
by: Huang, Tiansheng, et al.
Published: (2024) -
MPO: Multilingual Safety Alignment via Reward Gap Optimization
by: Zhao, Weixiang, et al.
Published: (2025)