Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Bohong, Yan, Shen, Zhang, Sijun, Lu, Jianqiao, Zeng, Yutao, Wang, Ya, Zhou, Xun
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2504.14992
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909590895984640
author	Wu, Bohong Yan, Shen Zhang, Sijun Lu, Jianqiao Zeng, Yutao Wang, Ya Zhou, Xun
author_facet	Wu, Bohong Yan, Shen Zhang, Sijun Lu, Jianqiao Zeng, Yutao Wang, Ya Zhou, Xun
contents	Recent advances in large language models have demonstrated the effectiveness of length scaling during post-training, yet its potential in pre-training remains underexplored. We present the Parallel Hidden Decoding Transformer (\textit{PHD}-Transformer), a novel framework that enables efficient length scaling during pre-training while maintaining inference efficiency. \textit{PHD}-Transformer achieves this through an innovative KV cache management strategy that distinguishes between original tokens and hidden decoding tokens. By retaining only the KV cache of original tokens for long-range dependencies while immediately discarding hidden decoding tokens after use, our approach maintains the same KV cache size as the vanilla transformer while enabling effective length scaling. To further enhance performance, we introduce two optimized variants: \textit{PHD-SWA} employs sliding window attention to preserve local dependencies, while \textit{PHD-CSWA} implements chunk-wise sliding window attention to eliminate linear growth in pre-filling time. Extensive experiments demonstrate consistent improvements across multiple benchmarks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_14992
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Efficient Pretraining Length Scaling Wu, Bohong Yan, Shen Zhang, Sijun Lu, Jianqiao Zeng, Yutao Wang, Ya Zhou, Xun Computation and Language Recent advances in large language models have demonstrated the effectiveness of length scaling during post-training, yet its potential in pre-training remains underexplored. We present the Parallel Hidden Decoding Transformer (\textit{PHD}-Transformer), a novel framework that enables efficient length scaling during pre-training while maintaining inference efficiency. \textit{PHD}-Transformer achieves this through an innovative KV cache management strategy that distinguishes between original tokens and hidden decoding tokens. By retaining only the KV cache of original tokens for long-range dependencies while immediately discarding hidden decoding tokens after use, our approach maintains the same KV cache size as the vanilla transformer while enabling effective length scaling. To further enhance performance, we introduce two optimized variants: \textit{PHD-SWA} employs sliding window attention to preserve local dependencies, while \textit{PHD-CSWA} implements chunk-wise sliding window attention to eliminate linear growth in pre-filling time. Extensive experiments demonstrate consistent improvements across multiple benchmarks.
title	Efficient Pretraining Length Scaling
topic	Computation and Language
url	https://arxiv.org/abs/2504.14992

Similar Items