Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gou, Dayin, Byun, Sanghyun, Malpeddi, Nilesh, De Micheli, Gabrielle, Vaste, Prathamesh, Song, Jacob, Chung, Woo Seong
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2510.12721
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914092344672256
author	Gou, Dayin Byun, Sanghyun Malpeddi, Nilesh De Micheli, Gabrielle Vaste, Prathamesh Song, Jacob Chung, Woo Seong
author_facet	Gou, Dayin Byun, Sanghyun Malpeddi, Nilesh De Micheli, Gabrielle Vaste, Prathamesh Song, Jacob Chung, Woo Seong
contents	Large Language Models (LLMs) typically rely on a large number of parameters for token embedding, leading to substantial storage requirements and memory footprints. In particular, LLMs deployed on edge devices are memory-bound, and reducing the memory footprint by compressing the embedding layer not only frees up the memory bandwidth but also speeds up inference. To address this, we introduce CARVQ, a post-training novel Corrective Adaptor combined with group Residual Vector Quantization. CARVQ relies on the composition of both linear and non-linear maps and mimics the original model embedding to compress to approximately 1.6 bits without requiring specialized hardware to support lower-bit storage. We test our method on pre-trained LLMs such as LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3.2-3B-Instruct, LLaMA-3.1-8B, Qwen2.5-7B, Qwen2.5-Math-7B and Phi-4, evaluating on common generative, discriminative, math and reasoning tasks. We show that in most cases, CARVQ can achieve lower average bitwidth-per-parameter while maintaining reasonable perplexity and accuracy compared to scalar quantization. Our contributions include a novel compression technique that is compatible with state-of-the-art transformer quantization methods and can be seamlessly integrated into any hardware supporting 4-bit memory to reduce the model's memory footprint in memory-constrained devices. This work demonstrates a crucial step toward the efficient deployment of LLMs on edge devices.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_12721
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression Gou, Dayin Byun, Sanghyun Malpeddi, Nilesh De Micheli, Gabrielle Vaste, Prathamesh Song, Jacob Chung, Woo Seong Machine Learning Large Language Models (LLMs) typically rely on a large number of parameters for token embedding, leading to substantial storage requirements and memory footprints. In particular, LLMs deployed on edge devices are memory-bound, and reducing the memory footprint by compressing the embedding layer not only frees up the memory bandwidth but also speeds up inference. To address this, we introduce CARVQ, a post-training novel Corrective Adaptor combined with group Residual Vector Quantization. CARVQ relies on the composition of both linear and non-linear maps and mimics the original model embedding to compress to approximately 1.6 bits without requiring specialized hardware to support lower-bit storage. We test our method on pre-trained LLMs such as LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3.2-3B-Instruct, LLaMA-3.1-8B, Qwen2.5-7B, Qwen2.5-Math-7B and Phi-4, evaluating on common generative, discriminative, math and reasoning tasks. We show that in most cases, CARVQ can achieve lower average bitwidth-per-parameter while maintaining reasonable perplexity and accuracy compared to scalar quantization. Our contributions include a novel compression technique that is compatible with state-of-the-art transformer quantization methods and can be seamlessly integrated into any hardware supporting 4-bit memory to reduce the model's memory footprint in memory-constrained devices. This work demonstrates a crucial step toward the efficient deployment of LLMs on edge devices.
title	CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression
topic	Machine Learning
url	https://arxiv.org/abs/2510.12721

Similar Items