Ravindran, S. K. (2025). Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers.
Chicago-Zitierstil (17. Ausg.)Ravindran, Santhosh Kumar. Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers. 2025.
MLA-Zitierstil (9. Ausg.)Ravindran, Santhosh Kumar. Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers. 2025.
Achtung: Diese Zitate sind unter Umständen nicht zu 100% korrekt.