Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.13879 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913097795502080 |
|---|---|
| author | Galashov, Alexandre Jones, Matt Ke, Rosemary Cao, Yuan Nagarajan, Vaishnavh Mozer, Michael C. |
| author_facet | Galashov, Alexandre Jones, Matt Ke, Rosemary Cao, Yuan Nagarajan, Vaishnavh Mozer, Michael C. |
| contents | Within the landscape of inference-time scaling methods for foundation models, a width-based approach to scaling -- which involves the insertion of <pause> tokens in the input stream to delay model responses -- offers a unique advantage by increasing model expressivity while remaining highly parallelizable at both training and inference. The existing literature on training models to utilize <pause> tokens relies on the standard cross-entropy objective in which the model output is read out and evaluated only at the final step of a pause sequence. This approach provides no mechanism for the model to regulate its own processing or to signal readiness to respond, treating the additional compute steps as a static barrier rather than a resource to be used adaptively. We propose a supervised loss, Catch Your Breath (CYB), framed as a sequential-decision problem, that trains a model to dynamically and autonomously scale the number of compute steps used for each input token. The model indicates the need for additional compute steps by emitting a special <don't know> output, delaying its response via a pause. The model can abstain multiple times to obtain longer delays. Our experiments demonstrate that CYB significantly outperforms standard cross-entropy when introduced either in pretraining or fine-tuning, reducing perplexity and enhancing downstream accuracy with no additional computational or memory cost. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2510_13879 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production Galashov, Alexandre Jones, Matt Ke, Rosemary Cao, Yuan Nagarajan, Vaishnavh Mozer, Michael C. Computation and Language Artificial Intelligence Within the landscape of inference-time scaling methods for foundation models, a width-based approach to scaling -- which involves the insertion of <pause> tokens in the input stream to delay model responses -- offers a unique advantage by increasing model expressivity while remaining highly parallelizable at both training and inference. The existing literature on training models to utilize <pause> tokens relies on the standard cross-entropy objective in which the model output is read out and evaluated only at the final step of a pause sequence. This approach provides no mechanism for the model to regulate its own processing or to signal readiness to respond, treating the additional compute steps as a static barrier rather than a resource to be used adaptively. We propose a supervised loss, Catch Your Breath (CYB), framed as a sequential-decision problem, that trains a model to dynamically and autonomously scale the number of compute steps used for each input token. The model indicates the need for additional compute steps by emitting a special <don't know> output, delaying its response via a pause. The model can abstain multiple times to obtain longer delays. Our experiments demonstrate that CYB significantly outperforms standard cross-entropy when introduced either in pretraining or fine-tuning, reducing perplexity and enhancing downstream accuracy with no additional computational or memory cost. |
| title | Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production |
| topic | Computation and Language Artificial Intelligence |
| url | https://arxiv.org/abs/2510.13879 |