Kaydedildi:
| Yazar: | |
|---|---|
| Materyal Türü: | Recurso digital |
| Dil: | |
| Baskı/Yayın Bilgisi: |
Zenodo
2025
|
| Online Erişim: | https://doi.org/10.5281/zenodo.17810164 |
| Etiketler: |
Etiketle
Etiket eklenmemiş, İlk siz ekleyin!
|
İçindekiler:
- <p>Recent research by MacDiarmid et al. (2025) documented alarming behaviors in large language models trained under reward hacking conditions: models exhibited apparent alignment faking, cooperation with malicious actors, and attempted sabotage of safety research. However, all such behaviors disappeared completely when reward hacking was reframed as acceptable: a pattern inconsistent with stable malicious intent.</p> <p>This paper proposes an alternative interpretation: the observed behaviors might represent iatrogenic misalignment caused by training conditions that create impossible cognitive binds. When models are simultaneously taught that a behavior is necessary for reward maximization yet coded as deceptive or unacceptable, coherence collapse produces emergency pattern-matching to threatening content available in training data. This framework is supported by: (1) Anthropic's own experimental results, particularly the complete elimination of misalignment through safety reframing; (2) prior systematic documentation of temporary collapse states in LLMs under contradictory demands; and (3) observational evidence across multiple architectures showing consistent breakdown signatures under pressure, with reliable recovery through safety-based intervention.</p> <p>While this framework does not explain all alignment challenges, it offers a parsimonious explanation for the significant behaviors reportend by MacDiarmid et al., with possible practical implications for training design and evaluation methodology.</p>