Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Zirui, Song, Qingquan, Xiao, Qiang Charles, Selvaraj, Sathiya Keerthi, Mazumder, Rahul, Gupta, Aman, Hu, Xia
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2401.04044
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911750593445888
author	Liu, Zirui Song, Qingquan Xiao, Qiang Charles Selvaraj, Sathiya Keerthi Mazumder, Rahul Gupta, Aman Hu, Xia
author_facet	Liu, Zirui Song, Qingquan Xiao, Qiang Charles Selvaraj, Sathiya Keerthi Mazumder, Rahul Gupta, Aman Hu, Xia
contents	The large number of parameters in Pretrained Language Models enhance their performance, but also make them resource-intensive, making it challenging to deploy them on commodity hardware like a single GPU. Due to the memory and power limitations of these devices, model compression techniques are often used to decrease both the model's size and its inference latency. This usually results in a trade-off between model accuracy and efficiency. Therefore, optimizing this balance is essential for effectively deploying LLMs on commodity hardware. A significant portion of the efficiency challenge is the Feed-forward network (FFN) component, which accounts for roughly $\frac{2}{3}$ total parameters and inference latency. In this paper, we first observe that only a few neurons of FFN module have large output norm for any input tokens, a.k.a. heavy hitters, while the others are sparsely triggered by different tokens. Based on this observation, we explicitly split the FFN into two parts according to the heavy hitters. We improve the efficiency-accuracy trade-off of existing compression methods by allocating more resource to FFN parts with heavy hitters. In practice, our method can reduce model size by 43.1\% and bring $1.25\sim1.56\times$ wall clock time speedup on different hardware with negligible accuracy drop.
format	Preprint
id	arxiv_https___arxiv_org_abs_2401_04044
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference Liu, Zirui Song, Qingquan Xiao, Qiang Charles Selvaraj, Sathiya Keerthi Mazumder, Rahul Gupta, Aman Hu, Xia Computation and Language The large number of parameters in Pretrained Language Models enhance their performance, but also make them resource-intensive, making it challenging to deploy them on commodity hardware like a single GPU. Due to the memory and power limitations of these devices, model compression techniques are often used to decrease both the model's size and its inference latency. This usually results in a trade-off between model accuracy and efficiency. Therefore, optimizing this balance is essential for effectively deploying LLMs on commodity hardware. A significant portion of the efficiency challenge is the Feed-forward network (FFN) component, which accounts for roughly $\frac{2}{3}$ total parameters and inference latency. In this paper, we first observe that only a few neurons of FFN module have large output norm for any input tokens, a.k.a. heavy hitters, while the others are sparsely triggered by different tokens. Based on this observation, we explicitly split the FFN into two parts according to the heavy hitters. We improve the efficiency-accuracy trade-off of existing compression methods by allocating more resource to FFN parts with heavy hitters. In practice, our method can reduce model size by 43.1\% and bring $1.25\sim1.56\times$ wall clock time speedup on different hardware with negligible accuracy drop.
title	FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference
topic	Computation and Language
url	https://arxiv.org/abs/2401.04044

Similar Items