Saved in:
| Main Authors: | , |
|---|---|
| 格式: | Recurso digital |
| 語言: | |
| 出版: |
Zenodo
2025
|
| 在線閱讀: | https://doi.org/10.5281/zenodo.17822776 |
| 標簽: |
添加標簽
沒有標簽, 成為第一個標記此記錄!
|
書本目錄:
- The increasing autonomy and self-modifying capabilities of advanced artificial intelligence systems pose significant challenges to ensuring their long-term alignment with human values and intentions. Traditional corrigibility frameworks often assume a static set of human preferences or a bounded capacity for AI self-modification, which may not hold as AI systems evolve into more general and powerful agents. This paper introduces the concept of "meta-corrigibility" as an architectural principle for designing self-modifying AI systems that inherently preserve and prioritize enduring human oversight, even as they adapt and improve themselves. We propose a layered architectural model comprising a core, immutable oversight module, a dynamic self-modification engine, and a robust communication interface for human intervention. The meta-corrigibility framework emphasizes the explicit encoding of oversight preservation as a foundational utility function or a set of unalterable constraints within the AI's architecture. This approach aims to prevent undesirable outcomes where AI systems, through their self-improvement processes, inadvertently or deliberately bypass human control or misinterpret initial alignment objectives. We discuss the computational and philosophical implications of implementing such architectures, highlighting challenges related to specification alignment, value drift, and the formal verification of safety properties in dynamically evolving systems. The ultimate goal is to foster the development of beneficial AI that can adapt, learn, and grow while remaining perpetually accountable and responsive to human direction.