Saved in:
Bibliographic Details
Main Authors: Liu, Shiwei, Tao, Guanchen, Zou, Yifei, Chow, Derek, Fan, Zichen, Lei, Kauna, Pan, Bangfei, Sylvester, Dennis, Kielian, Gregory, Saligane, Mehdi
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.10930
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912119797055488
author Liu, Shiwei
Tao, Guanchen
Zou, Yifei
Chow, Derek
Fan, Zichen
Lei, Kauna
Pan, Bangfei
Sylvester, Dennis
Kielian, Gregory
Saligane, Mehdi
author_facet Liu, Shiwei
Tao, Guanchen
Zou, Yifei
Chow, Derek
Fan, Zichen
Lei, Kauna
Pan, Bangfei
Sylvester, Dennis
Kielian, Gregory
Saligane, Mehdi
contents The self-attention mechanism distinguishes transformer-based large language models (LLMs) apart from convolutional and recurrent neural networks. Despite the performance improvement, achieving real-time LLM inference on silicon remains challenging due to the extensive use of Softmax in self-attention. In addition to the non-linearity, the low arithmetic intensity significantly limits processing parallelism, especially when working with longer contexts. To address this challenge, we propose Constant Softmax (ConSmax), a software-hardware co-design that serves as an efficient alternative to Softmax. ConSmax utilizes differentiable normalization parameters to eliminate the need for maximum searching and denominator summation in Softmax. This approach enables extensive parallelization while still executing the essential functions of Softmax. Moreover, a scalable ConSmax hardware design with a bitwidth-split look-up table (LUT) can achieve lossless non-linear operations and support mixed-precision computing. Experimental results show that ConSmax achieves a minuscule power consumption of 0.2mW and an area of 0.0008mm^2 at 1250MHz working frequency in 16nm FinFET technology. For open-source contribution, we further implement our design with the OpenROAD toolchain under SkyWater's 130nm CMOS technology. The corresponding power is 2.69mW and the area is 0.007mm^2. ConSmax achieves 3.35x power savings and 2.75x area savings in 16nm technology, and 3.15x power savings and 4.14x area savings with the open-source EDA toolchain. In the meantime, it also maintains comparable accuracy on the GPT-2 model and the WikiText103 dataset. The project is available at https://github.com/ReaLLMASIC/ConSmax
format Preprint
id arxiv_https___arxiv_org_abs_2402_10930
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters
Liu, Shiwei
Tao, Guanchen
Zou, Yifei
Chow, Derek
Fan, Zichen
Lei, Kauna
Pan, Bangfei
Sylvester, Dennis
Kielian, Gregory
Saligane, Mehdi
Hardware Architecture
Artificial Intelligence
Machine Learning
The self-attention mechanism distinguishes transformer-based large language models (LLMs) apart from convolutional and recurrent neural networks. Despite the performance improvement, achieving real-time LLM inference on silicon remains challenging due to the extensive use of Softmax in self-attention. In addition to the non-linearity, the low arithmetic intensity significantly limits processing parallelism, especially when working with longer contexts. To address this challenge, we propose Constant Softmax (ConSmax), a software-hardware co-design that serves as an efficient alternative to Softmax. ConSmax utilizes differentiable normalization parameters to eliminate the need for maximum searching and denominator summation in Softmax. This approach enables extensive parallelization while still executing the essential functions of Softmax. Moreover, a scalable ConSmax hardware design with a bitwidth-split look-up table (LUT) can achieve lossless non-linear operations and support mixed-precision computing. Experimental results show that ConSmax achieves a minuscule power consumption of 0.2mW and an area of 0.0008mm^2 at 1250MHz working frequency in 16nm FinFET technology. For open-source contribution, we further implement our design with the OpenROAD toolchain under SkyWater's 130nm CMOS technology. The corresponding power is 2.69mW and the area is 0.007mm^2. ConSmax achieves 3.35x power savings and 2.75x area savings in 16nm technology, and 3.15x power savings and 4.14x area savings with the open-source EDA toolchain. In the meantime, it also maintains comparable accuracy on the GPT-2 model and the WikiText103 dataset. The project is available at https://github.com/ReaLLMASIC/ConSmax
title ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters
topic Hardware Architecture
Artificial Intelligence
Machine Learning
url https://arxiv.org/abs/2402.10930