Gespeichert in:
| Hauptverfasser: | , , , , |
|---|---|
| Format: | Preprint |
| Veröffentlicht: |
2025
|
| Schlagworte: | |
| Online-Zugang: | https://arxiv.org/abs/2503.04725 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| _version_ | 1866911228642721792 |
|---|---|
| author | Chen, Zhuo Comas, Oriol Mayné i Jin, Zhuotao Luo, Di Soljačić, Marin |
| author_facet | Chen, Zhuo Comas, Oriol Mayné i Jin, Zhuotao Luo, Di Soljačić, Marin |
| contents | We present a universal theoretical framework for understanding long-context language modeling based on a bipartite mutual information scaling law that we rigorously verify in natural language. We demonstrate that bipartite mutual information captures multi-token interactions distinct from and scaling independently of conventional two-point mutual information, and show that this provides a more complete characterization of the dependencies needed for accurately modeling long sequences. Leveraging this scaling law, we formulate the Long-context Language Modeling (L$^2$M) condition, which lower bounds the necessary scaling of a model's history state -- the latent variables responsible for storing past information -- for effective long-context modeling. We validate the framework and its predictions on transformer and state-space models. Our work provides a principled foundation to understand long-context modeling and to design more efficient architectures with stronger long-context capabilities, with potential applications beyond natural language. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2503_04725 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling Chen, Zhuo Comas, Oriol Mayné i Jin, Zhuotao Luo, Di Soljačić, Marin Computation and Language Artificial Intelligence Information Theory Machine Learning Data Analysis, Statistics and Probability We present a universal theoretical framework for understanding long-context language modeling based on a bipartite mutual information scaling law that we rigorously verify in natural language. We demonstrate that bipartite mutual information captures multi-token interactions distinct from and scaling independently of conventional two-point mutual information, and show that this provides a more complete characterization of the dependencies needed for accurately modeling long sequences. Leveraging this scaling law, we formulate the Long-context Language Modeling (L$^2$M) condition, which lower bounds the necessary scaling of a model's history state -- the latent variables responsible for storing past information -- for effective long-context modeling. We validate the framework and its predictions on transformer and state-space models. Our work provides a principled foundation to understand long-context modeling and to design more efficient architectures with stronger long-context capabilities, with potential applications beyond natural language. |
| title | L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling |
| topic | Computation and Language Artificial Intelligence Information Theory Machine Learning Data Analysis, Statistics and Probability |
| url | https://arxiv.org/abs/2503.04725 |