Saved in:
Bibliographic Details
Main Authors: Koike-Akino, Toshiaki, Chen, Xiangyu, Liu, Jing, Wang, Ye, Pu, Wang, Brand, Matthew
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.18413
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909621976825856
author Koike-Akino, Toshiaki
Chen, Xiangyu
Liu, Jing
Wang, Ye
Pu
Wang
Brand, Matthew
author_facet Koike-Akino, Toshiaki
Chen, Xiangyu
Liu, Jing
Wang, Ye
Pu
Wang
Brand, Matthew
contents Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2505_18413
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle LatentLLM: Attention-Aware Joint Tensor Compression
Koike-Akino, Toshiaki
Chen, Xiangyu
Liu, Jing
Wang, Ye
Pu
Wang
Brand, Matthew
Machine Learning
Artificial Intelligence
Computation and Language
Computer Vision and Pattern Recognition
Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.
title LatentLLM: Attention-Aware Joint Tensor Compression
topic Machine Learning
Artificial Intelligence
Computation and Language
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2505.18413