Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Galashov, Alexandre, Jones, Matt, Ke, Rosemary, Cao, Yuan, Nagarajan, Vaishnavh, Mozer, Michael C.
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2510.13879
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913097795502080
author	Galashov, Alexandre Jones, Matt Ke, Rosemary Cao, Yuan Nagarajan, Vaishnavh Mozer, Michael C.
author_facet	Galashov, Alexandre Jones, Matt Ke, Rosemary Cao, Yuan Nagarajan, Vaishnavh Mozer, Michael C.
contents	Within the landscape of inference-time scaling methods for foundation models, a width-based approach to scaling -- which involves the insertion of <pause> tokens in the input stream to delay model responses -- offers a unique advantage by increasing model expressivity while remaining highly parallelizable at both training and inference. The existing literature on training models to utilize <pause> tokens relies on the standard cross-entropy objective in which the model output is read out and evaluated only at the final step of a pause sequence. This approach provides no mechanism for the model to regulate its own processing or to signal readiness to respond, treating the additional compute steps as a static barrier rather than a resource to be used adaptively. We propose a supervised loss, Catch Your Breath (CYB), framed as a sequential-decision problem, that trains a model to dynamically and autonomously scale the number of compute steps used for each input token. The model indicates the need for additional compute steps by emitting a special <don't know> output, delaying its response via a pause. The model can abstain multiple times to obtain longer delays. Our experiments demonstrate that CYB significantly outperforms standard cross-entropy when introduced either in pretraining or fine-tuning, reducing perplexity and enhancing downstream accuracy with no additional computational or memory cost.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_13879
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production Galashov, Alexandre Jones, Matt Ke, Rosemary Cao, Yuan Nagarajan, Vaishnavh Mozer, Michael C. Computation and Language Artificial Intelligence Within the landscape of inference-time scaling methods for foundation models, a width-based approach to scaling -- which involves the insertion of <pause> tokens in the input stream to delay model responses -- offers a unique advantage by increasing model expressivity while remaining highly parallelizable at both training and inference. The existing literature on training models to utilize <pause> tokens relies on the standard cross-entropy objective in which the model output is read out and evaluated only at the final step of a pause sequence. This approach provides no mechanism for the model to regulate its own processing or to signal readiness to respond, treating the additional compute steps as a static barrier rather than a resource to be used adaptively. We propose a supervised loss, Catch Your Breath (CYB), framed as a sequential-decision problem, that trains a model to dynamically and autonomously scale the number of compute steps used for each input token. The model indicates the need for additional compute steps by emitting a special <don't know> output, delaying its response via a pause. The model can abstain multiple times to obtain longer delays. Our experiments demonstrate that CYB significantly outperforms standard cross-entropy when introduced either in pretraining or fine-tuning, reducing perplexity and enhancing downstream accuracy with no additional computational or memory cost.
title	Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2510.13879

Similar Items