Saved in:
Bibliographic Details
Main Authors: Malomgré, Elias, Simoens, Pieter
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.14844
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • AI alignment is growing in importance, yet many current approaches learn safety behavior by directly modifying policy parameters, entangling normative constraints with the underlying policy. This often yields opaque, difficult-to-edit alignment artifacts and reduces their reuse across models or deployments, a failure mode we term Alignment Waste. We propose Interactionless Inverse Reinforcement Learning, a framework for learning inspectable, editable, and reusable reward artifacts separately from policy optimization. We further introduce the Alignment Flywheel, a human-in-the-loop lifecycle for iteratively auditing, patching, and hardening these artifacts through automated evaluation and refinement. Together, these ideas recast alignment from a disposable training expense into a durable, verifiable engineering asset.