MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Fornasiere, Damiano, Bronzi, Mirko, Kitts, Spencer, Palmas, Alessandro, Bengio, Yoshua, Richardson, Oliver
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Artificial Intelligence
Accesso online:	https://arxiv.org/abs/2604.17465
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866909006314864640
author	Fornasiere, Damiano Bronzi, Mirko Kitts, Spencer Palmas, Alessandro Bengio, Yoshua Richardson, Oliver
author_facet	Fornasiere, Damiano Bronzi, Mirko Kitts, Spencer Palmas, Alessandro Bengio, Yoshua Richardson, Oliver
contents	We provide evidence that language models can detect, localize and, to a certain degree, verbalize the difference between perturbations applied to their activations. More precisely, we either (a) mask activations, simulating dropout, or (b) add Gaussian noise to them, at a target sentence. We then ask a multiple-choice question such as "Which of the previous sentences was perturbed?" or "Which of the two perturbations was applied?". We test models from the Llama, Olmo, and Qwen families, with sizes between 8B and 32B, all of which can easily detect and localize the perturbations, often with perfect accuracy. These models can also learn, when taught in context, to distinguish between dropout and Gaussian noise. Notably, Qwen3-32B's zero-shot accuracy in identifying which perturbation was applied improves as a function of the perturbation strength and, moreover, decreases if the in-context labels are flipped, suggesting a prior for the correct ones -- even modulo controls. Because dropout has been used as a training-regularization technique, while Gaussian noise is sometimes added during inference, we discuss the possibility of a data-agnostic "training awareness" signal and the implications for AI safety.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_17465
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Language models recognize dropout and Gaussian noise applied to their activations Fornasiere, Damiano Bronzi, Mirko Kitts, Spencer Palmas, Alessandro Bengio, Yoshua Richardson, Oliver Artificial Intelligence We provide evidence that language models can detect, localize and, to a certain degree, verbalize the difference between perturbations applied to their activations. More precisely, we either (a) mask activations, simulating dropout, or (b) add Gaussian noise to them, at a target sentence. We then ask a multiple-choice question such as "Which of the previous sentences was perturbed?" or "Which of the two perturbations was applied?". We test models from the Llama, Olmo, and Qwen families, with sizes between 8B and 32B, all of which can easily detect and localize the perturbations, often with perfect accuracy. These models can also learn, when taught in context, to distinguish between dropout and Gaussian noise. Notably, Qwen3-32B's zero-shot accuracy in identifying which perturbation was applied improves as a function of the perturbation strength and, moreover, decreases if the in-context labels are flipped, suggesting a prior for the correct ones -- even modulo controls. Because dropout has been used as a training-regularization technique, while Gaussian noise is sometimes added during inference, we discuss the possibility of a data-agnostic "training awareness" signal and the implications for AI safety.
title	Language models recognize dropout and Gaussian noise applied to their activations
topic	Artificial Intelligence
url	https://arxiv.org/abs/2604.17465

Documenti analoghi