Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gu, David, Belcak, Peter, Wattenhofer, Roger
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2503.11426
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909537467891712
author	Gu, David Belcak, Peter Wattenhofer, Roger
author_facet	Gu, David Belcak, Peter Wattenhofer, Roger
contents	We challenge the prevailing assumption that LLMs must rely fully on sub-word tokens for high-quality text generation. To this end, we propose the "Generative Pretrained Thoughtformer" (GPTHF), a hierarchical transformer language model capable of text generation by compressing text into sentence embeddings and employing a sentence attention mechanism. GPTHF retains GPT's architecture, modifying only token interactions via dynamic sparse attention masks. Our experiments show that GPTHF achieves an up to an order of magnitude improvement in FLOPs efficiency and a threefold increase in runtime speed compared to equally-sized GPT models in the low-size regime. This is achieved through a unique generation method that caches and reuses sentence embeddings, allowing significant portions of the input to bypass large parts of the network.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_11426
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Text Compression for Efficient Language Generation Gu, David Belcak, Peter Wattenhofer, Roger Computation and Language We challenge the prevailing assumption that LLMs must rely fully on sub-word tokens for high-quality text generation. To this end, we propose the "Generative Pretrained Thoughtformer" (GPTHF), a hierarchical transformer language model capable of text generation by compressing text into sentence embeddings and employing a sentence attention mechanism. GPTHF retains GPT's architecture, modifying only token interactions via dynamic sparse attention masks. Our experiments show that GPTHF achieves an up to an order of magnitude improvement in FLOPs efficiency and a threefold increase in runtime speed compared to equally-sized GPT models in the low-size regime. This is achieved through a unique generation method that caches and reuses sentence embeddings, allowing significant portions of the input to bypass large parts of the network.
title	Text Compression for Efficient Language Generation
topic	Computation and Language
url	https://arxiv.org/abs/2503.11426

Similar Items