MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Lee, Jungi, Lee, Wonbeom, Sim, Jaewoong
Natura:	Preprint
Pubblicazione:	2024
Soggetti:	Machine Learning Hardware Architecture
Accesso online:	https://arxiv.org/abs/2406.12930
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866909227132387328
author	Lee, Jungi Lee, Wonbeom Sim, Jaewoong
author_facet	Lee, Jungi Lee, Wonbeom Sim, Jaewoong
contents	Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning and have thus become one of the most important workloads in today's computing landscape. However, deploying LLM inference poses challenges due to the high compute and memory requirements stemming from the enormous model size and the difficulty of running it in the integer pipelines. In this paper, we present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision. Based on our analysis of outlier values in LLMs, we propose a decomposed quantization technique in which the scale factors of decomposed matrices are powers of two apart. The proposed scheme allows us to avoid explicit requantization (i.e., dequantization/quantization) when accumulating the partial sums from the decomposed matrices, with a minimal extension to the commodity tensor compute hardware. Our evaluation shows that Tender achieves higher accuracy and inference performance compared to the state-of-the-art methods while also being significantly less intrusive to the existing accelerators.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_12930
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization Lee, Jungi Lee, Wonbeom Sim, Jaewoong Machine Learning Hardware Architecture Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning and have thus become one of the most important workloads in today's computing landscape. However, deploying LLM inference poses challenges due to the high compute and memory requirements stemming from the enormous model size and the difficulty of running it in the integer pipelines. In this paper, we present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision. Based on our analysis of outlier values in LLMs, we propose a decomposed quantization technique in which the scale factors of decomposed matrices are powers of two apart. The proposed scheme allows us to avoid explicit requantization (i.e., dequantization/quantization) when accumulating the partial sums from the decomposed matrices, with a minimal extension to the commodity tensor compute hardware. Our evaluation shows that Tender achieves higher accuracy and inference performance compared to the state-of-the-art methods while also being significantly less intrusive to the existing accelerators.
title	Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization
topic	Machine Learning Hardware Architecture
url	https://arxiv.org/abs/2406.12930

Documenti analoghi