Saved in:
Bibliographic Details
Main Author: Dillon, Joshua V.
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.19085
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912859023212544
author Dillon, Joshua V.
author_facet Dillon, Joshua V.
contents Biological neural systems must be fast but are energy-constrained. Evolution's solution: act on the first signal. Winner-take-all circuits and time-to-first-spike coding implicitly treat when a neuron fires as an expression of confidence. We apply this principle to ensembles of Tiny Recursive Models (TRM) [Jolicoeur-Martineau et al., 2025]. On Sudoku-Extreme, halt-first selection achieves 97% accuracy vs. 91% for probability averaging -- while requiring 10x fewer reasoning steps. A single baseline model achieves 85.5% +/- 1.3%. Can we internalize this as a training-only cost? Yes: by maintaining K=4 parallel latent states but backpropping only through the lowest-loss "winner," we achieve 96.9% +/- 0.6% accuracy -- matching ensemble performance at 1x inference cost, with less than half the variance of the baseline. A key diagnostic: 89% of baseline failures are selection problems, revealing a 99% accuracy ceiling. As in nature, this work was also resource constrained: all experiments used a single RTX 5090. A modified SwiGLU [Shazeer, 2020] made Muon [Jordan et al., 2024] and high LR viable, enabling baseline training in 48 minutes and full WTA (K=4) in 6 hours on consumer hardware.
format Preprint
id arxiv_https___arxiv_org_abs_2601_19085
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Speed is Confidence
Dillon, Joshua V.
Machine Learning
Biological neural systems must be fast but are energy-constrained. Evolution's solution: act on the first signal. Winner-take-all circuits and time-to-first-spike coding implicitly treat when a neuron fires as an expression of confidence. We apply this principle to ensembles of Tiny Recursive Models (TRM) [Jolicoeur-Martineau et al., 2025]. On Sudoku-Extreme, halt-first selection achieves 97% accuracy vs. 91% for probability averaging -- while requiring 10x fewer reasoning steps. A single baseline model achieves 85.5% +/- 1.3%. Can we internalize this as a training-only cost? Yes: by maintaining K=4 parallel latent states but backpropping only through the lowest-loss "winner," we achieve 96.9% +/- 0.6% accuracy -- matching ensemble performance at 1x inference cost, with less than half the variance of the baseline. A key diagnostic: 89% of baseline failures are selection problems, revealing a 99% accuracy ceiling. As in nature, this work was also resource constrained: all experiments used a single RTX 5090. A modified SwiGLU [Shazeer, 2020] made Muon [Jordan et al., 2024] and high LR viable, enabling baseline training in 48 minutes and full WTA (K=4) in 6 hours on consumer hardware.
title Speed is Confidence
topic Machine Learning
url https://arxiv.org/abs/2601.19085