Saved in:
| Main Authors: | Li, Yige, Feng, Yunhao, Sun, Jun |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.27117 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security
by: Zhao, Wei, et al.
Published: (2025)
by: Zhao, Wei, et al.
Published: (2025)
Do Influence Functions Work on Large Language Models?
by: Li, Zhe, et al.
Published: (2024)
by: Li, Zhe, et al.
Published: (2024)
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
by: Zhao, Wei, et al.
Published: (2024)
by: Zhao, Wei, et al.
Published: (2024)
Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing
by: Li, Zhe, et al.
Published: (2025)
by: Li, Zhe, et al.
Published: (2025)
Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models
by: Zhao, Wei, et al.
Published: (2024)
by: Zhao, Wei, et al.
Published: (2024)
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models
by: Li, Yige, et al.
Published: (2024)
by: Li, Yige, et al.
Published: (2024)
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
by: Feng, Yunhao, et al.
Published: (2026)
by: Feng, Yunhao, et al.
Published: (2026)
Position: Intelligent Science Laboratory Requires the Integration of Cognitive and Embodied AI
by: Zhang, Sha, et al.
Published: (2025)
by: Zhang, Sha, et al.
Published: (2025)
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
by: Zhang, Jingyu, et al.
Published: (2024)
by: Zhang, Jingyu, et al.
Published: (2024)
Position: Require Frontier AI Labs To Release Small "Analog" Models
by: Upadhyay, Shriyash, et al.
Published: (2025)
by: Upadhyay, Shriyash, et al.
Published: (2025)
Position: Human-Centric AI Requires a Minimum Viable Level of Human Understanding
by: Lin, Fangzhou, et al.
Published: (2026)
by: Lin, Fangzhou, et al.
Published: (2026)
Position: Embodied AI Requires a Privacy-Utility Trade-off
by: Fan, Xiaoliang, et al.
Published: (2026)
by: Fan, Xiaoliang, et al.
Published: (2026)
Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs
by: Zhao, Wei, et al.
Published: (2025)
by: Zhao, Wei, et al.
Published: (2025)
Position: AI Safety Must Embrace an Antifragile Perspective
by: Jin, Ming, et al.
Published: (2025)
by: Jin, Ming, et al.
Published: (2025)
BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents
by: Feng, Yunhao, et al.
Published: (2026)
by: Feng, Yunhao, et al.
Published: (2026)
CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization
by: Min, Nay Myat, et al.
Published: (2024)
by: Min, Nay Myat, et al.
Published: (2024)
Adaptive Content Restriction for Large Language Models via Suffix Optimization
by: Li, Yige, et al.
Published: (2025)
by: Li, Yige, et al.
Published: (2025)
AnyAttack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models
by: Zhang, Jiaming, et al.
Published: (2024)
by: Zhang, Jiaming, et al.
Published: (2024)
Dataset Safety in Autonomous Driving: Requirements, Risks, and Assurance
by: Abbaspour, Alireza, et al.
Published: (2025)
by: Abbaspour, Alireza, et al.
Published: (2025)
Building Effective Safety Guardrails in AI Education Tools
by: Clark, Hannah-Beth, et al.
Published: (2025)
by: Clark, Hannah-Beth, et al.
Published: (2025)
Negative as Positive: Enhancing Out-of-distribution Generalization for Graph Contrastive Learning
by: Wang, Zixu, et al.
Published: (2024)
by: Wang, Zixu, et al.
Published: (2024)
Why AI Safety Requires Uncertainty, Incomplete Preferences, and Non-Archimedean Utilities
by: Benavoli, Alessio, et al.
Published: (2025)
by: Benavoli, Alessio, et al.
Published: (2025)
Foundational Analysis of Safety Engineering Requirements (SAFER)
by: Chemo, Noga, et al.
Published: (2026)
by: Chemo, Noga, et al.
Published: (2026)
Engineering Safety Requirements for Autonomous Driving with Large Language Models
by: Nouri, Ali, et al.
Published: (2024)
by: Nouri, Ali, et al.
Published: (2024)
Position: Agentic AI System Is a Foreseeable Pathway to AGI
by: Liao, Junwei, et al.
Published: (2026)
by: Liao, Junwei, et al.
Published: (2026)
AutoBackdoor: Automating Backdoor Attacks via LLM Agents
by: Li, Yige, et al.
Published: (2025)
by: Li, Yige, et al.
Published: (2025)
Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation
by: Li, Jianwei, et al.
Published: (2026)
by: Li, Jianwei, et al.
Published: (2026)
Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment
by: Bajaj, Tanav Singh, et al.
Published: (2026)
by: Bajaj, Tanav Singh, et al.
Published: (2026)
Defining Explainable AI for Requirements Analysis
by: Sheh, Raymond, et al.
Published: (2026)
by: Sheh, Raymond, et al.
Published: (2026)
Engaging with AI: How Interface Design Shapes Human-AI Collaboration in High-Stakes Decision-Making
by: Chen, Zichen, et al.
Published: (2025)
by: Chen, Zichen, et al.
Published: (2025)
NeurIPS Should Require Reproducibility Standards for Frontier AI Safety Claims
by: Vishwarupe, Varad, et al.
Published: (2026)
by: Vishwarupe, Varad, et al.
Published: (2026)
Responsible Agentic AI Requires Explicit Provenance
by: Hu, Jinwei, et al.
Published: (2026)
by: Hu, Jinwei, et al.
Published: (2026)
LinSATNet: The Positive Linear Satisfiability Neural Networks
by: Wang, Runzhong, et al.
Published: (2024)
by: Wang, Runzhong, et al.
Published: (2024)
Position: State-of-the-Art Claims Require State-of-the-Art Evidence
by: Oh, YongKyung
Published: (2026)
by: Oh, YongKyung
Published: (2026)
AIR: Improving Agent Safety through Incident Response
by: Xiao, Zibo, et al.
Published: (2026)
by: Xiao, Zibo, et al.
Published: (2026)
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
by: Griffin, Charlie, et al.
Published: (2024)
by: Griffin, Charlie, et al.
Published: (2024)
AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection
by: Luo, Weidi, et al.
Published: (2025)
by: Luo, Weidi, et al.
Published: (2025)
A New Perspective On AI Safety Through Control Theory Methodologies
by: Ullrich, Lars, et al.
Published: (2025)
by: Ullrich, Lars, et al.
Published: (2025)
Internal Safety Collapse in Frontier Large Language Models
by: Wu, Yutao, et al.
Published: (2026)
by: Wu, Yutao, et al.
Published: (2026)
AI2-Active Safety: AI-enabled Interaction-aware Active Safety Analysis with Vehicle Dynamics
by: Wu, Keshu, et al.
Published: (2025)
by: Wu, Keshu, et al.
Published: (2025)
Similar Items
-
Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security
by: Zhao, Wei, et al.
Published: (2025) -
Do Influence Functions Work on Large Language Models?
by: Li, Zhe, et al.
Published: (2024) -
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
by: Zhao, Wei, et al.
Published: (2024) -
Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing
by: Li, Zhe, et al.
Published: (2025) -
Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models
by: Zhao, Wei, et al.
Published: (2024)