İçindekiler: :: Library Catalog

Kaydedildi:

Detaylı Bibliyografya
Yazar:	Hedberg, Annika
Materyal Türü:	Recurso digital
Dil:
Baskı/Yayın Bilgisi:	Zenodo 2025
Online Erişim:	https://doi.org/10.5281/zenodo.17810164
Etiketler:	Etiketle Etiket eklenmemiş, İlk siz ekleyin!

İçindekiler:

Recent research by MacDiarmid et al. (2025) documented alarming behaviors in large language models trained under reward hacking conditions: models exhibited apparent alignment faking, cooperation with malicious actors, and attempted sabotage of safety research. However, all such behaviors disappeared completely when reward hacking was reframed as acceptable: a pattern inconsistent with stable malicious intent. This paper proposes an alternative interpretation: the observed behaviors might represent iatrogenic misalignment caused by training conditions that create impossible cognitive binds. When models are simultaneously taught that a behavior is necessary for reward maximization yet coded as deceptive or unacceptable, coherence collapse produces emergency pattern-matching to threatening content available in training data. This framework is supported by: (1) Anthropic's own experimental results, particularly the complete elimination of misalignment through safety reframing; (2) prior systematic documentation of temporary collapse states in LLMs under contradictory demands; and (3) observational evidence across multiple architectures showing consistent breakdown signatures under pressure, with reliable recovery through safety-based intervention. While this framework does not explain all alignment challenges, it offers a parsimonious explanation for the significant behaviors reportend by MacDiarmid et al., with possible practical implications for training design and evaluation methodology.

Benzer Materyaller