Saved in:
| Main Authors: | Zhang, Yuqi, Ding, Liang, Zhang, Lefei, Tao, Dacheng |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2401.06561 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders
by: Zhang, Yuqi, et al.
Published: (2025)
by: Zhang, Yuqi, et al.
Published: (2025)
Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation
by: Cai, Shizhan, et al.
Published: (2025)
by: Cai, Shizhan, et al.
Published: (2025)
Defending LLMs against Jailbreaking Attacks via Backtranslation
by: Wang, Yihan, et al.
Published: (2024)
by: Wang, Yihan, et al.
Published: (2024)
Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?
by: Atil, Berk, et al.
Published: (2025)
by: Atil, Berk, et al.
Published: (2025)
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
by: Liu, Fan, et al.
Published: (2024)
by: Liu, Fan, et al.
Published: (2024)
Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs
by: Wu, Yuchen, et al.
Published: (2025)
by: Wu, Yuchen, et al.
Published: (2025)
From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks
by: Zhang, Zhexin, et al.
Published: (2024)
by: Zhang, Zhexin, et al.
Published: (2024)
DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification
by: Li, Yu, et al.
Published: (2025)
by: Li, Yu, et al.
Published: (2025)
Healthcare Copilot: Eliciting the Power of General LLMs for Medical Consultation
by: Ren, Zhiyao, et al.
Published: (2024)
by: Ren, Zhiyao, et al.
Published: (2024)
The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check
by: Lu, Qingyu, et al.
Published: (2026)
by: Lu, Qingyu, et al.
Published: (2026)
Model Hemorrhage and the Robustness Limits of Large Language Models
by: Ma, Ziyang, et al.
Published: (2025)
by: Ma, Ziyang, et al.
Published: (2025)
MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators
by: Lu, Qingyu, et al.
Published: (2024)
by: Lu, Qingyu, et al.
Published: (2024)
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models
by: Lu, Qingyu, et al.
Published: (2023)
by: Lu, Qingyu, et al.
Published: (2023)
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models
by: Gao, Lang, et al.
Published: (2024)
by: Gao, Lang, et al.
Published: (2024)
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition
by: Zhang, Ziyang, et al.
Published: (2024)
by: Zhang, Ziyang, et al.
Published: (2024)
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
by: Zhang, Zhexin, et al.
Published: (2023)
by: Zhang, Zhexin, et al.
Published: (2023)
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
by: Zhu, Junda, et al.
Published: (2025)
by: Zhu, Junda, et al.
Published: (2025)
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
by: Ji, Jiabao, et al.
Published: (2024)
by: Ji, Jiabao, et al.
Published: (2024)
KaFT: Knowledge-aware Fine-tuning for Boosting LLMs' Domain-specific Question-Answering Performance
by: Zhong, Qihuang, et al.
Published: (2025)
by: Zhong, Qihuang, et al.
Published: (2025)
The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking
by: Miao, Yuchun, et al.
Published: (2025)
by: Miao, Yuchun, et al.
Published: (2025)
HSF: Defending against Jailbreak Attacks with Hidden State Filtering
by: Qian, Cheng, et al.
Published: (2024)
by: Qian, Cheng, et al.
Published: (2024)
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
by: Zhao, Weixiang, et al.
Published: (2025)
by: Zhao, Weixiang, et al.
Published: (2025)
Building Accurate Translation-Tailored LLMs with Language Aware Instruction Tuning
by: Zan, Changtong, et al.
Published: (2024)
by: Zan, Changtong, et al.
Published: (2024)
Reason-KE++: Aligning the Process, Not Just the Outcome, for Faithful LLM Knowledge Editing
by: Wu, Yuchen, et al.
Published: (2025)
by: Wu, Yuchen, et al.
Published: (2025)
Robust Knowledge Editing via Explicit Reasoning Chains for Distractor-Resilient Multi-Hop QA
by: Wu, Yuchen, et al.
Published: (2025)
by: Wu, Yuchen, et al.
Published: (2025)
FusionBench: A Unified Library and Comprehensive Benchmark for Deep Model Fusion
by: Tang, Anke, et al.
Published: (2024)
by: Tang, Anke, et al.
Published: (2024)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models
by: Ying, Zonghao, et al.
Published: (2025)
by: Ying, Zonghao, et al.
Published: (2025)
WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual World Knowledge
by: Wang, Wenbin, et al.
Published: (2024)
by: Wang, Wenbin, et al.
Published: (2024)
Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks
by: Zhao, Jiawei, et al.
Published: (2024)
by: Zhao, Jiawei, et al.
Published: (2024)
Uncertainty Aware Learning for Language Model Alignment
by: Wang, Yikun, et al.
Published: (2024)
by: Wang, Yikun, et al.
Published: (2024)
Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
by: Li, Xiaoxia, et al.
Published: (2024)
by: Li, Xiaoxia, et al.
Published: (2024)
Revisiting Catastrophic Forgetting in Large Language Model Tuning
by: Li, Hongyu, et al.
Published: (2024)
by: Li, Hongyu, et al.
Published: (2024)
Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing
by: Zhao, Yinzhi, et al.
Published: (2026)
by: Zhao, Yinzhi, et al.
Published: (2026)
Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia
by: Shen, Guangyu, et al.
Published: (2024)
by: Shen, Guangyu, et al.
Published: (2024)
RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction
by: Jiang, Tanqiu, et al.
Published: (2024)
by: Jiang, Tanqiu, et al.
Published: (2024)
ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding
by: Zhong, Qihuang, et al.
Published: (2024)
by: Zhong, Qihuang, et al.
Published: (2024)
E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation
by: Zhong, Qihuang, et al.
Published: (2022)
by: Zhong, Qihuang, et al.
Published: (2022)
Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt
by: Peng, Keqin, et al.
Published: (2025)
by: Peng, Keqin, et al.
Published: (2025)
PANDA: Prompt Transfer Meets Knowledge Distillation for Efficient Model Adaptation
by: Zhong, Qihuang, et al.
Published: (2022)
by: Zhong, Qihuang, et al.
Published: (2022)
DB-LLM: Accurate Dual-Binarization for Efficient LLMs
by: Chen, Hong, et al.
Published: (2024)
by: Chen, Hong, et al.
Published: (2024)
Similar Items
-
AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders
by: Zhang, Yuqi, et al.
Published: (2025) -
Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation
by: Cai, Shizhan, et al.
Published: (2025) -
Defending LLMs against Jailbreaking Attacks via Backtranslation
by: Wang, Yihan, et al.
Published: (2024) -
Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?
by: Atil, Berk, et al.
Published: (2025) -
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
by: Liu, Fan, et al.
Published: (2024)