Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.18413 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866909621976825856 |
|---|---|
| author | Koike-Akino, Toshiaki Chen, Xiangyu Liu, Jing Wang, Ye Pu Wang Brand, Matthew |
| author_facet | Koike-Akino, Toshiaki Chen, Xiangyu Liu, Jing Wang, Ye Pu Wang Brand, Matthew |
| contents | Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2505_18413 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | LatentLLM: Attention-Aware Joint Tensor Compression Koike-Akino, Toshiaki Chen, Xiangyu Liu, Jing Wang, Ye Pu Wang Brand, Matthew Machine Learning Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks. |
| title | LatentLLM: Attention-Aware Joint Tensor Compression |
| topic | Machine Learning Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2505.18413 |