Saved in:
| Main Authors: | Dung, Leonard, Mai, Florian |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.11235 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Failures in Perspective-taking of Multimodal AI Systems
by: Leonard, Bridget, et al.
Published: (2024)
by: Leonard, Bridget, et al.
Published: (2024)
Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare
by: Tagliabue, Valen, et al.
Published: (2025)
by: Tagliabue, Valen, et al.
Published: (2025)
Containment Verification: AI Safety Guarantees Independent of Alignment
by: Moon, Royce, et al.
Published: (2026)
by: Moon, Royce, et al.
Published: (2026)
Mechanistic Interpretability for AI Safety -- A Review
by: Bereska, Leonard, et al.
Published: (2024)
by: Bereska, Leonard, et al.
Published: (2024)
"We are not Future-ready": Understanding AI Privacy Risks and Existing Mitigation Strategies from the Perspective of AI Developers in Europe
by: Klymenko, Alexandra, et al.
Published: (2025)
by: Klymenko, Alexandra, et al.
Published: (2025)
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
by: Yin, Bo, et al.
Published: (2026)
by: Yin, Bo, et al.
Published: (2026)
Against racing to AGI: Cooperation, deterrence, and catastrophic risks
by: Dung, Leonard, et al.
Published: (2025)
by: Dung, Leonard, et al.
Published: (2025)
Misalignment or misuse? The AGI alignment tradeoff
by: Hellrigel-Holderbaum, Max, et al.
Published: (2025)
by: Hellrigel-Holderbaum, Max, et al.
Published: (2025)
An Adversarial Perspective on Machine Unlearning for AI Safety
by: Łucki, Jakub, et al.
Published: (2024)
by: Łucki, Jakub, et al.
Published: (2024)
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
by: Chen, Jianhui, et al.
Published: (2024)
by: Chen, Jianhui, et al.
Published: (2024)
Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors
by: Wiedermann-Möller, Jonas, et al.
Published: (2026)
by: Wiedermann-Möller, Jonas, et al.
Published: (2026)
Low-Resource Safety Failures Are Action Failures, Not Representation Failures
by: Aziz, Rashad, et al.
Published: (2026)
by: Aziz, Rashad, et al.
Published: (2026)
Preference Learning for AI Alignment: a Causal Perspective
by: Kobalczyk, Katarzyna, et al.
Published: (2025)
by: Kobalczyk, Katarzyna, et al.
Published: (2025)
Are we Doomed to an AI Race? Why Self-Interest Could Drive Countries Towards a Moratorium on Superintelligence
by: Roussel, Edward, et al.
Published: (2026)
by: Roussel, Edward, et al.
Published: (2026)
Position: AI Safety Must Embrace an Antifragile Perspective
by: Jin, Ming, et al.
Published: (2025)
by: Jin, Ming, et al.
Published: (2025)
Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges
by: Lu, Haoran, et al.
Published: (2025)
by: Lu, Haoran, et al.
Published: (2025)
Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective
by: Osooli, Hamid, et al.
Published: (2026)
by: Osooli, Hamid, et al.
Published: (2026)
Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making
by: Son, Yejin, et al.
Published: (2025)
by: Son, Yejin, et al.
Published: (2025)
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
by: Li, Xiaomin, et al.
Published: (2026)
by: Li, Xiaomin, et al.
Published: (2026)
Rationalize: Shared Semantic Reasoning for Human-AI Alignment
by: Dasgupta, Aritra, et al.
Published: (2026)
by: Dasgupta, Aritra, et al.
Published: (2026)
Seeing Eye to Eye: Enabling Cognitive Alignment Through Shared First-Person Perspective in Human-AI Collaboration
by: Teng, Zhuyu, et al.
Published: (2026)
by: Teng, Zhuyu, et al.
Published: (2026)
Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications
by: David, Isaac, et al.
Published: (2026)
by: David, Isaac, et al.
Published: (2026)
HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
by: Zhou, Xuhui, et al.
Published: (2024)
by: Zhou, Xuhui, et al.
Published: (2024)
Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives
by: Zeng, Wei, et al.
Published: (2025)
by: Zeng, Wei, et al.
Published: (2025)
SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment
by: Fang, Xianya, et al.
Published: (2026)
by: Fang, Xianya, et al.
Published: (2026)
A New Perspective On AI Safety Through Control Theory Methodologies
by: Ullrich, Lars, et al.
Published: (2025)
by: Ullrich, Lars, et al.
Published: (2025)
Risk Alignment in Agentic AI Systems
by: Clatterbuck, Hayley, et al.
Published: (2024)
by: Clatterbuck, Hayley, et al.
Published: (2024)
Implicit Safety Alignment from Crowd Preferences
by: Lin, Qian, et al.
Published: (2026)
by: Lin, Qian, et al.
Published: (2026)
Exploring Privacy and Fairness Risks in Sharing Diffusion Models: An Adversarial Perspective
by: Luo, Xinjian, et al.
Published: (2024)
by: Luo, Xinjian, et al.
Published: (2024)
Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment
by: Bajaj, Tanav Singh, et al.
Published: (2026)
by: Bajaj, Tanav Singh, et al.
Published: (2026)
Whose Truth? Pluralistic Geo-Alignment for (Agentic) AI
by: Janowicz, Krzysztof, et al.
Published: (2025)
by: Janowicz, Krzysztof, et al.
Published: (2025)
Confirmation Bias in Generative AI Chatbots: Mechanisms, Risks, Mitigation Strategies, and Future Research Directions
by: Du, Yiran
Published: (2025)
by: Du, Yiran
Published: (2025)
Independence Is Not an Issue in Neurosymbolic AI
by: Faronius, Håkan Karlsson, et al.
Published: (2025)
by: Faronius, Håkan Karlsson, et al.
Published: (2025)
Computational Safety for Generative AI: A Signal Processing Perspective
by: Chen, Pin-Yu
Published: (2025)
by: Chen, Pin-Yu
Published: (2025)
Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation
by: Sun, Chenkai, et al.
Published: (2025)
by: Sun, Chenkai, et al.
Published: (2025)
AI and Human Oversight: A Risk-Based Framework for Alignment
by: Kandikatla, Laxmiraju, et al.
Published: (2025)
by: Kandikatla, Laxmiraju, et al.
Published: (2025)
A Technological Perspective on Misuse of Available AI
by: Pöhler, Lukas, et al.
Published: (2024)
by: Pöhler, Lukas, et al.
Published: (2024)
Dialogical Reasoning Across AI Architectures: A Multi-Model Framework for Testing AI Alignment Strategies
by: Cox, Gray
Published: (2026)
by: Cox, Gray
Published: (2026)
AI Risk Management Should Incorporate Both Safety and Security
by: Qi, Xiangyu, et al.
Published: (2024)
by: Qi, Xiangyu, et al.
Published: (2024)
Wide Reflective Equilibrium in LLM Alignment: Bridging Moral Epistemology and AI Safety
by: Brophy, Matthew
Published: (2025)
by: Brophy, Matthew
Published: (2025)
Similar Items
-
Failures in Perspective-taking of Multimodal AI Systems
by: Leonard, Bridget, et al.
Published: (2024) -
Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare
by: Tagliabue, Valen, et al.
Published: (2025) -
Containment Verification: AI Safety Guarantees Independent of Alignment
by: Moon, Royce, et al.
Published: (2026) -
Mechanistic Interpretability for AI Safety -- A Review
by: Bereska, Leonard, et al.
Published: (2024) -
"We are not Future-ready": Understanding AI Privacy Risks and Existing Mitigation Strategies from the Perspective of AI Developers in Europe
by: Klymenko, Alexandra, et al.
Published: (2025)