Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Chen, Zhuo, Comas, Oriol Mayné i, Jin, Zhuotao, Luo, Di, Soljačić, Marin
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Computation and Language Artificial Intelligence Information Theory Machine Learning Data Analysis, Statistics and Probability
Online-Zugang:	https://arxiv.org/abs/2503.04725
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866911228642721792
author	Chen, Zhuo Comas, Oriol Mayné i Jin, Zhuotao Luo, Di Soljačić, Marin
author_facet	Chen, Zhuo Comas, Oriol Mayné i Jin, Zhuotao Luo, Di Soljačić, Marin
contents	We present a universal theoretical framework for understanding long-context language modeling based on a bipartite mutual information scaling law that we rigorously verify in natural language. We demonstrate that bipartite mutual information captures multi-token interactions distinct from and scaling independently of conventional two-point mutual information, and show that this provides a more complete characterization of the dependencies needed for accurately modeling long sequences. Leveraging this scaling law, we formulate the Long-context Language Modeling (L$^2$M) condition, which lower bounds the necessary scaling of a model's history state -- the latent variables responsible for storing past information -- for effective long-context modeling. We validate the framework and its predictions on transformer and state-space models. Our work provides a principled foundation to understand long-context modeling and to design more efficient architectures with stronger long-context capabilities, with potential applications beyond natural language.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_04725
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling Chen, Zhuo Comas, Oriol Mayné i Jin, Zhuotao Luo, Di Soljačić, Marin Computation and Language Artificial Intelligence Information Theory Machine Learning Data Analysis, Statistics and Probability We present a universal theoretical framework for understanding long-context language modeling based on a bipartite mutual information scaling law that we rigorously verify in natural language. We demonstrate that bipartite mutual information captures multi-token interactions distinct from and scaling independently of conventional two-point mutual information, and show that this provides a more complete characterization of the dependencies needed for accurately modeling long sequences. Leveraging this scaling law, we formulate the Long-context Language Modeling (L$^2$M) condition, which lower bounds the necessary scaling of a model's history state -- the latent variables responsible for storing past information -- for effective long-context modeling. We validate the framework and its predictions on transformer and state-space models. Our work provides a principled foundation to understand long-context modeling and to design more efficient architectures with stronger long-context capabilities, with potential applications beyond natural language.
title	L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling
topic	Computation and Language Artificial Intelligence Information Theory Machine Learning Data Analysis, Statistics and Probability
url	https://arxiv.org/abs/2503.04725

Ähnliche Einträge