Saved in:
| Main Authors: | Cao, Chentao, Xu, Xiaojun, Han, Bo, Li, Hang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.11629 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
by: Zhang, Jiawei, et al.
Published: (2025)
by: Zhang, Jiawei, et al.
Published: (2025)
Reasoning as an Adaptive Defense for Safety
by: Kim, Taeyoun, et al.
Published: (2025)
by: Kim, Taeyoun, et al.
Published: (2025)
Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval
by: Chen, Taiye, et al.
Published: (2025)
by: Chen, Taiye, et al.
Published: (2025)
Can LLM Safety Be Ensured by Constraining Parameter Regions?
by: Li, Zongmin, et al.
Published: (2026)
by: Li, Zongmin, et al.
Published: (2026)
Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds
by: Zuo, Qian, et al.
Published: (2025)
by: Zuo, Qian, et al.
Published: (2025)
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment
by: Ghosal, Soumya Suvra, et al.
Published: (2024)
by: Ghosal, Soumya Suvra, et al.
Published: (2024)
Odysseus: Jailbreaking Commercial Multimodal LLM-integrated Systems via Dual Steganography
by: Li, Songze, et al.
Published: (2025)
by: Li, Songze, et al.
Published: (2025)
Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable
by: Huang, Tiansheng, et al.
Published: (2025)
by: Huang, Tiansheng, et al.
Published: (2025)
Adversarial Reasoning at Jailbreaking Time
by: Sabbaghi, Mahdi, et al.
Published: (2025)
by: Sabbaghi, Mahdi, et al.
Published: (2025)
A Causal Perspective for Enhancing Jailbreak Attack and Defense
by: Pan, Licheng, et al.
Published: (2026)
by: Pan, Licheng, et al.
Published: (2026)
LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment
by: Zhang, Haonan, et al.
Published: (2026)
by: Zhang, Haonan, et al.
Published: (2026)
Beyond the Answer: Decoding the Behavior of LLMs as Scientific Reasoners
by: Pandey, Rohan, et al.
Published: (2026)
by: Pandey, Rohan, et al.
Published: (2026)
Plantain: Plan-Answer Interleaved Reasoning
by: Liang, Anthony, et al.
Published: (2025)
by: Liang, Anthony, et al.
Published: (2025)
Conformal Feedback Alignment: Quantifying Answer-Level Reliability for Robust LLM Alignment
by: Chen, Tiejin, et al.
Published: (2026)
by: Chen, Tiejin, et al.
Published: (2026)
Incentivizing LLMs to Self-Verify Their Answers
by: Zhang, Fuxiang, et al.
Published: (2025)
by: Zhang, Fuxiang, et al.
Published: (2025)
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
by: Yi, Sibo, et al.
Published: (2024)
by: Yi, Sibo, et al.
Published: (2024)
KnowGraph: Knowledge-Enabled Anomaly Detection via Logical Reasoning on Graph Data
by: Zhou, Andy, et al.
Published: (2024)
by: Zhou, Andy, et al.
Published: (2024)
Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing
by: Wang, Zehao, et al.
Published: (2026)
by: Wang, Zehao, et al.
Published: (2026)
LocalGCL: Local-aware Contrastive Learning for Graphs
by: Jiang, Haojun, et al.
Published: (2024)
by: Jiang, Haojun, et al.
Published: (2024)
AlphaApollo: A System for Deep Agentic Reasoning
by: Zhou, Zhanke, et al.
Published: (2025)
by: Zhou, Zhanke, et al.
Published: (2025)
Efficient Safety Retrofitting Against Jailbreaking for LLMs
by: Garcia-Gasulla, Dario, et al.
Published: (2025)
by: Garcia-Gasulla, Dario, et al.
Published: (2025)
KnowHalu: Hallucination Detection via Multi-Form Knowledge Based Factual Checking
by: Zhang, Jiawei, et al.
Published: (2024)
by: Zhang, Jiawei, et al.
Published: (2024)
Multilingual Safety Alignment via Self-Distillation
by: Qin, Ruiyang, et al.
Published: (2026)
by: Qin, Ruiyang, et al.
Published: (2026)
CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal Reasoning
by: Wang, Yongxin, et al.
Published: (2025)
by: Wang, Yongxin, et al.
Published: (2025)
SoSBench: Benchmarking Safety Alignment on Six Scientific Domains
by: Jiang, Fengqing, et al.
Published: (2025)
by: Jiang, Fengqing, et al.
Published: (2025)
Gradients as an Action: Towards Communication-Efficient Federated Recommender Systems via Adaptive Action Sharing
by: Lu, Zhufeng, et al.
Published: (2025)
by: Lu, Zhufeng, et al.
Published: (2025)
Course-Correction: Safety Alignment Using Synthetic Preferences
by: Xu, Rongwu, et al.
Published: (2024)
by: Xu, Rongwu, et al.
Published: (2024)
Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills
by: Wang, Changsheng, et al.
Published: (2025)
by: Wang, Changsheng, et al.
Published: (2025)
Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following
by: Wang, Chenyang, et al.
Published: (2025)
by: Wang, Chenyang, et al.
Published: (2025)
Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron
by: Shen, Sicheng, et al.
Published: (2026)
by: Shen, Sicheng, et al.
Published: (2026)
REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak
by: Ma, Jiachen, et al.
Published: (2026)
by: Ma, Jiachen, et al.
Published: (2026)
Hindsight Hint Distillation: Scaffolded Reasoning for SWE Agents from CoT-free Answers
by: Wang, Shengjie, et al.
Published: (2026)
by: Wang, Shengjie, et al.
Published: (2026)
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment
by: Li, Cheryl, et al.
Published: (2025)
by: Li, Cheryl, et al.
Published: (2025)
Curriculum Learning for Safety Alignment
by: Kumar, Sandeep, et al.
Published: (2026)
by: Kumar, Sandeep, et al.
Published: (2026)
Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense
by: Min, Rui, et al.
Published: (2024)
by: Min, Rui, et al.
Published: (2024)
Beyond Alignment: Expanding Reasoning Capacity via Manifold-Reshaping Policy Optimization
by: Wang, Dayu, et al.
Published: (2026)
by: Wang, Dayu, et al.
Published: (2026)
When to Trust the Cheap Check: Weak and Strong Verification for Reasoning
by: Kiyani, Shayan, et al.
Published: (2026)
by: Kiyani, Shayan, et al.
Published: (2026)
Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning
by: Li, Xuan, et al.
Published: (2026)
by: Li, Xuan, et al.
Published: (2026)
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
by: Fares, Samar, et al.
Published: (2024)
by: Fares, Samar, et al.
Published: (2024)
AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models
by: Chen, Guangke, et al.
Published: (2025)
by: Chen, Guangke, et al.
Published: (2025)
Similar Items
-
Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
by: Zhang, Jiawei, et al.
Published: (2025) -
Reasoning as an Adaptive Defense for Safety
by: Kim, Taeyoun, et al.
Published: (2025) -
Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval
by: Chen, Taiye, et al.
Published: (2025) -
Can LLM Safety Be Ensured by Constraining Parameter Regions?
by: Li, Zongmin, et al.
Published: (2026) -
Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds
by: Zuo, Qian, et al.
Published: (2025)