Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ye, ChangMin, Sim, Yonguk, Kim, Youngchae, Jin, SeongMin, Jeong, Doo Seok
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2412.04778
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909458438815744
author	Ye, ChangMin Sim, Yonguk Kim, Youngchae Jin, SeongMin Jeong, Doo Seok
author_facet	Ye, ChangMin Sim, Yonguk Kim, Youngchae Jin, SeongMin Jeong, Doo Seok
contents	Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock time. Layer normalization is one of the key workloads in the transformer model, following each of multi-head attention and feed-forward network blocks. To reduce data movement, layer normalization needs to be performed on the same chip as the matrix-matrix multiplication engine. To this end, we introduce an iterative L2-normalization method for 1D input (IterL2Norm), ensuring fast convergence to the steady-state solution within five iteration steps and high precision, outperforming the fast inverse square root algorithm in six out of nine cases for FP32 and five out of nine for BFloat16 across the embedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the IterL2Norm macro normalizes $d$-dimensional vectors, where $64 \leq d \leq 1024$, with a latency of 116-227 cycles at 100MHz/1.05V.
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_04778
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	IterL2Norm: Fast Iterative L2-Normalization Ye, ChangMin Sim, Yonguk Kim, Youngchae Jin, SeongMin Jeong, Doo Seok Machine Learning Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock time. Layer normalization is one of the key workloads in the transformer model, following each of multi-head attention and feed-forward network blocks. To reduce data movement, layer normalization needs to be performed on the same chip as the matrix-matrix multiplication engine. To this end, we introduce an iterative L2-normalization method for 1D input (IterL2Norm), ensuring fast convergence to the steady-state solution within five iteration steps and high precision, outperforming the fast inverse square root algorithm in six out of nine cases for FP32 and five out of nine for BFloat16 across the embedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the IterL2Norm macro normalizes $d$-dimensional vectors, where $64 \leq d \leq 1024$, with a latency of 116-227 cycles at 100MHz/1.05V.
title	IterL2Norm: Fast Iterative L2-Normalization
topic	Machine Learning
url	https://arxiv.org/abs/2412.04778

Similar Items