Saved in:
| Main Authors: | Zhang, Yichi, Zhang, Siyuan, Huang, Yao, Xia, Zeyu, Fang, Zhengwei, Yang, Xiao, Duan, Ranjie, Yan, Dong, Dong, Yinpeng, Zhu, Jun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.02384 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Reasoning as State Transition: A Representational Analysis of Reasoning Evolution in Large Language Models
by: Zhang, Siyuan, et al.
Published: (2026)
by: Zhang, Siyuan, et al.
Published: (2026)
Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention
by: Zhang, Yichi, et al.
Published: (2025)
by: Zhang, Yichi, et al.
Published: (2025)
MESA: Improving MoE Safety Alignment via Decentralized Expertise
by: Sun, Yitong, et al.
Published: (2026)
by: Sun, Yitong, et al.
Published: (2026)
Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization
by: Zhang, Siyuan, et al.
Published: (2025)
by: Zhang, Siyuan, et al.
Published: (2025)
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability
by: Zhang, Yichi, et al.
Published: (2025)
by: Zhang, Yichi, et al.
Published: (2025)
Mitigating Overthinking in Large Reasoning Models via Manifold Steering
by: Huang, Yao, et al.
Published: (2025)
by: Huang, Yao, et al.
Published: (2025)
Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation
by: Zhang, Yichi, et al.
Published: (2025)
by: Zhang, Yichi, et al.
Published: (2025)
DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios
by: Huang, Yao, et al.
Published: (2025)
by: Huang, Yao, et al.
Published: (2025)
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models
by: Zhang, Yichi, et al.
Published: (2024)
by: Zhang, Yichi, et al.
Published: (2024)
Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space
by: Huang, Yao, et al.
Published: (2025)
by: Huang, Yao, et al.
Published: (2025)
Evil Geniuses: Delving into the Safety of LLM-based Agents
by: Tian, Yu, et al.
Published: (2023)
by: Tian, Yu, et al.
Published: (2023)
Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
by: Zhang, Yichi, et al.
Published: (2026)
by: Zhang, Yichi, et al.
Published: (2026)
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models
by: Zhang, Yichi, et al.
Published: (2024)
by: Zhang, Yichi, et al.
Published: (2024)
Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models
by: Duan, Ranjie, et al.
Published: (2025)
by: Duan, Ranjie, et al.
Published: (2025)
T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
by: Miao, Yibo, et al.
Published: (2024)
by: Miao, Yibo, et al.
Published: (2024)
CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention
by: Hu, Xiaomeng, et al.
Published: (2025)
by: Hu, Xiaomeng, et al.
Published: (2025)
RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations
by: Li, Hanyu, et al.
Published: (2026)
by: Li, Hanyu, et al.
Published: (2026)
Rethinking Model Ensemble in Transfer-based Adversarial Attacks
by: Chen, Huanran, et al.
Published: (2023)
by: Chen, Huanran, et al.
Published: (2023)
Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection
by: Sun, Guanglong, et al.
Published: (2026)
by: Sun, Guanglong, et al.
Published: (2026)
Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models
by: Ding, Xuan, et al.
Published: (2026)
by: Ding, Xuan, et al.
Published: (2026)
BSPA: Exploring Black-box Stealthy Prompt Attacks against Image Generators
by: Tian, Yu, et al.
Published: (2024)
by: Tian, Yu, et al.
Published: (2024)
STAR-PólyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision
by: Wu, Jiaao, et al.
Published: (2026)
by: Wu, Jiaao, et al.
Published: (2026)
Scaling Laws for Black box Adversarial Attacks
by: Liu, Chuan, et al.
Published: (2024)
by: Liu, Chuan, et al.
Published: (2024)
STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering
by: Wang, Yueqian, et al.
Published: (2024)
by: Wang, Yueqian, et al.
Published: (2024)
MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments
by: Yang, Xiao, et al.
Published: (2025)
by: Yang, Xiao, et al.
Published: (2025)
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
by: Liu, Jianyu, et al.
Published: (2025)
by: Liu, Jianyu, et al.
Published: (2025)
A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents
by: Su, Hang, et al.
Published: (2025)
by: Su, Hang, et al.
Published: (2025)
Unveiling the Basin-Like Loss Landscape in Large Language Models
by: Chen, Huanran, et al.
Published: (2025)
by: Chen, Huanran, et al.
Published: (2025)
Language Models Fail to Introspect About Their Knowledge of Language
by: Song, Siyuan, et al.
Published: (2025)
by: Song, Siyuan, et al.
Published: (2025)
Privileged Self-Access Matters for Introspection in AI
by: Song, Siyuan, et al.
Published: (2025)
by: Song, Siyuan, et al.
Published: (2025)
Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion
by: Zhao, Dong, et al.
Published: (2025)
by: Zhao, Dong, et al.
Published: (2025)
Enhancing Reasoning Abilities of Small LLMs with Cognitive Alignment
by: Cai, Wenrui, et al.
Published: (2025)
by: Cai, Wenrui, et al.
Published: (2025)
Measuring Iterative Temporal Reasoning with Time Puzzles
by: Wang, Zhengxiang, et al.
Published: (2026)
by: Wang, Zhengxiang, et al.
Published: (2026)
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack
by: Li, Hao, et al.
Published: (2026)
by: Li, Hao, et al.
Published: (2026)
STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules
by: Wu, Di, et al.
Published: (2026)
by: Wu, Di, et al.
Published: (2026)
Recursive Introspection: Teaching Language Model Agents How to Self-Improve
by: Qu, Yuxiao, et al.
Published: (2024)
by: Qu, Yuxiao, et al.
Published: (2024)
Improving Safety Alignment via Balanced Direct Preference Optimization
by: Zhao, Shiji, et al.
Published: (2026)
by: Zhao, Shiji, et al.
Published: (2026)
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
by: Fu, Yu, et al.
Published: (2023)
by: Fu, Yu, et al.
Published: (2023)
Learning to Evolve: Bayesian-Guided Continual Knowledge Graph Embedding
by: Li, Linyu, et al.
Published: (2025)
by: Li, Linyu, et al.
Published: (2025)
Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering
by: Zhang, Yichi, et al.
Published: (2023)
by: Zhang, Yichi, et al.
Published: (2023)
Similar Items
-
Reasoning as State Transition: A Representational Analysis of Reasoning Evolution in Large Language Models
by: Zhang, Siyuan, et al.
Published: (2026) -
Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention
by: Zhang, Yichi, et al.
Published: (2025) -
MESA: Improving MoE Safety Alignment via Decentralized Expertise
by: Sun, Yitong, et al.
Published: (2026) -
Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization
by: Zhang, Siyuan, et al.
Published: (2025) -
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability
by: Zhang, Yichi, et al.
Published: (2025)