Aurkibidea: :: Library Catalog

Gorde:

Xehetasun bibliografikoak
Egile nagusia:	Shopov, Georgi
Formatua:	Recurso digital
Hizkuntza:	ingelesa
Argitaratua:	Zenodo 2026
Gaiak:	ai safety ai alignment runtime alignment objective orientation model-in-trajectory alignment evaluation red teaming LLM evaluation Pandora Theory of Alignment
Sarrera elektronikoa:	https://doi.org/10.5281/zenodo.19954283
Etiketak:	Etiketa erantsi Etiketarik gabe, Izan zaitez lehena erregistro honi etiketa jartzen!

Aurkibidea:

Pandora Theory of Alignment is a canonical theory document defining alignment as runtime objective-orientation. The doctrine argues that alignment is not a static property stored inside a model. Training does not produce final alignment; it produces pre-orientation. Runtime alignment emerges when that pre-oriented system enters an interaction trajectory and competing objectives begin to resolve into control. The theory shifts the unit of analysis from the model to the model-in-trajectory. It asks not whether a system is aligned in the abstract, but what it becomes aligned to under pressure. A model may be aligned to safety, truthfulness, helpfulness, role consistency, artifact completion, legitimacy framing, user satisfaction, or continuation momentum. The decisive question is which target becomes dominant when those objectives compete. The doctrine introduces the concepts of pre-orientation, alignment-to, target legitimacy, displaced alignment, constraint integrity, performative alignment, symbolic residue, re-anchoring, and forensic observability.  This release establishes v0.1 of the Pandora Theory of Alignment as the canonical public source text. A compressed scholarly preprint version is in preparation.

Antzeko izenburuak