:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Dung, Leonard, Mai, Florian
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2510.11235
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Failures in Perspective-taking of Multimodal AI Systems
by: Leonard, Bridget, et al.
Published: (2024)

Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare
by: Tagliabue, Valen, et al.
Published: (2025)

Containment Verification: AI Safety Guarantees Independent of Alignment
by: Moon, Royce, et al.
Published: (2026)

Mechanistic Interpretability for AI Safety -- A Review
by: Bereska, Leonard, et al.
Published: (2024)

"We are not Future-ready": Understanding AI Privacy Risks and Existing Mitigation Strategies from the Perspective of AI Developers in Europe
by: Klymenko, Alexandra, et al.
Published: (2025)

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
by: Yin, Bo, et al.
Published: (2026)

Against racing to AGI: Cooperation, deterrence, and catastrophic risks
by: Dung, Leonard, et al.
Published: (2025)

Misalignment or misuse? The AGI alignment tradeoff
by: Hellrigel-Holderbaum, Max, et al.
Published: (2025)

An Adversarial Perspective on Machine Unlearning for AI Safety
by: Łucki, Jakub, et al.
Published: (2024)

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
by: Chen, Jianhui, et al.
Published: (2024)

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors
by: Wiedermann-Möller, Jonas, et al.
Published: (2026)

Low-Resource Safety Failures Are Action Failures, Not Representation Failures
by: Aziz, Rashad, et al.
Published: (2026)

Preference Learning for AI Alignment: a Causal Perspective
by: Kobalczyk, Katarzyna, et al.
Published: (2025)

Are we Doomed to an AI Race? Why Self-Interest Could Drive Countries Towards a Moratorium on Superintelligence
by: Roussel, Edward, et al.
Published: (2026)

Position: AI Safety Must Embrace an Antifragile Perspective
by: Jin, Ming, et al.
Published: (2025)

Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges
by: Lu, Haoran, et al.
Published: (2025)

Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective
by: Osooli, Hamid, et al.
Published: (2026)

Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making
by: Son, Yejin, et al.
Published: (2025)

Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
by: Li, Xiaomin, et al.
Published: (2026)

Rationalize: Shared Semantic Reasoning for Human-AI Alignment
by: Dasgupta, Aritra, et al.
Published: (2026)

Seeing Eye to Eye: Enabling Cognitive Alignment Through Shared First-Person Perspective in Human-AI Collaboration
by: Teng, Zhuyu, et al.
Published: (2026)

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications
by: David, Isaac, et al.
Published: (2026)

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions
by: Zhou, Xuhui, et al.
Published: (2024)

Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives
by: Zeng, Wei, et al.
Published: (2025)

SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment
by: Fang, Xianya, et al.
Published: (2026)

A New Perspective On AI Safety Through Control Theory Methodologies
by: Ullrich, Lars, et al.
Published: (2025)

Risk Alignment in Agentic AI Systems
by: Clatterbuck, Hayley, et al.
Published: (2024)

Implicit Safety Alignment from Crowd Preferences
by: Lin, Qian, et al.
Published: (2026)

Exploring Privacy and Fairness Risks in Sharing Diffusion Models: An Adversarial Perspective
by: Luo, Xinjian, et al.
Published: (2024)

Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment
by: Bajaj, Tanav Singh, et al.
Published: (2026)

Whose Truth? Pluralistic Geo-Alignment for (Agentic) AI
by: Janowicz, Krzysztof, et al.
Published: (2025)

Confirmation Bias in Generative AI Chatbots: Mechanisms, Risks, Mitigation Strategies, and Future Research Directions
by: Du, Yiran
Published: (2025)

Independence Is Not an Issue in Neurosymbolic AI
by: Faronius, Håkan Karlsson, et al.
Published: (2025)

Computational Safety for Generative AI: A Signal Processing Perspective
by: Chen, Pin-Yu
Published: (2025)

Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation
by: Sun, Chenkai, et al.
Published: (2025)

AI and Human Oversight: A Risk-Based Framework for Alignment
by: Kandikatla, Laxmiraju, et al.
Published: (2025)

A Technological Perspective on Misuse of Available AI
by: Pöhler, Lukas, et al.
Published: (2024)

Dialogical Reasoning Across AI Architectures: A Multi-Model Framework for Testing AI Alignment Strategies
by: Cox, Gray
Published: (2026)

AI Risk Management Should Incorporate Both Safety and Security
by: Qi, Xiangyu, et al.
Published: (2024)

Wide Reflective Equilibrium in LLM Alignment: Bridging Moral Epistemology and AI Safety
by: Brophy, Matthew
Published: (2025)