Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteur principal:	Sapunov, Grigory
Format:	Preprint
Publié:	2026
Sujets:	Machine Learning Artificial Intelligence Computation and Language I.2.6
Accès en ligne:	https://arxiv.org/abs/2604.21999
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866909011199131648
author	Sapunov, Grigory
author_facet	Sapunov, Grigory
contents	We study learned memory tokens as a computational scratchpad for a single-block Universal Transformer with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. Memory tokens are empirically necessary: no configuration without them reaches non-trivial performance. The optimal count has a sharp lower threshold (T=0 always fails, T=8 reliably succeeds) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and a dilution boundary at T=64. Under halt-side pressure (lambda warmup), mean halt drops monotonically with memory size across the plateau (from 11.6 at T=8 to 8.3 at T=64), showing that memory tokens and ponder depth substitute as resources at fixed accuracy. We also identify a router initialization trap that causes the majority of training runs to fail: both default zero-bias and Graves' recommended positive bias settle into a shallow halt equilibrium the model cannot escape. Inverting the bias to -3 ("deep start") eliminates the failure mode, and ablation shows the trap is inherent to ACT initialization rather than an artifact of our architecture. With reliable training, ACT yields an order of magnitude lower seed variance than fixed-depth processing (+/-0.7 vs +/-9.3 pp); lambda warmup recovers 34% of compute at matched accuracy; and attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code: https://github.com/che-shr-cat/utm-jax.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_21999
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning Sapunov, Grigory Machine Learning Artificial Intelligence Computation and Language I.2.6 We study learned memory tokens as a computational scratchpad for a single-block Universal Transformer with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. Memory tokens are empirically necessary: no configuration without them reaches non-trivial performance. The optimal count has a sharp lower threshold (T=0 always fails, T=8 reliably succeeds) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and a dilution boundary at T=64. Under halt-side pressure (lambda warmup), mean halt drops monotonically with memory size across the plateau (from 11.6 at T=8 to 8.3 at T=64), showing that memory tokens and ponder depth substitute as resources at fixed accuracy. We also identify a router initialization trap that causes the majority of training runs to fail: both default zero-bias and Graves' recommended positive bias settle into a shallow halt equilibrium the model cannot escape. Inverting the bias to -3 ("deep start") eliminates the failure mode, and ablation shows the trap is inherent to ACT initialization rather than an artifact of our architecture. With reliable training, ACT yields an order of magnitude lower seed variance than fixed-depth processing (+/-0.7 vs +/-9.3 pp); lambda warmup recovers 34% of compute at matched accuracy; and attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code: https://github.com/che-shr-cat/utm-jax.
title	Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
topic	Machine Learning Artificial Intelligence Computation and Language I.2.6
url	https://arxiv.org/abs/2604.21999

Documents similaires