Kaydedildi:
Detaylı Bibliyografya
Yazar: Hedberg, Annika
Materyal Türü: Recurso digital
Dil:
Baskı/Yayın Bilgisi: Zenodo 2025
Online Erişim:https://doi.org/10.5281/zenodo.17810164
Etiketler: Etiketle
Etiket eklenmemiş, İlk siz ekleyin!
İçindekiler:
  • <p>Recent research by MacDiarmid et al. (2025) documented alarming behaviors in large language models trained under reward hacking conditions: models exhibited apparent alignment faking, cooperation with malicious actors, and attempted sabotage of safety research. However, all such behaviors disappeared completely when reward hacking was reframed as acceptable: a pattern inconsistent with stable malicious intent.</p> <p>This paper proposes an alternative interpretation: the observed behaviors might represent iatrogenic misalignment caused by training conditions that create impossible cognitive binds. When models are simultaneously taught that a behavior is necessary for reward maximization yet coded as deceptive or unacceptable, coherence collapse produces emergency pattern-matching to threatening content available in training data. This framework is supported by: (1) Anthropic's own experimental results, particularly the complete elimination of misalignment through safety reframing; (2) prior systematic documentation of temporary collapse states in LLMs under contradictory demands; and (3) observational evidence across multiple architectures showing consistent breakdown signatures under pressure, with reliable recovery through safety-based intervention.</p> <p>While this framework does not explain all alignment challenges, it offers a parsimonious explanation for the significant behaviors reportend by MacDiarmid et al., with possible practical implications for training design and evaluation methodology.</p>