Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Chengming, Ding, Xinheng, Sun, Baixi, Yu, Xiaodong, Zheng, Weijian, Xie, Zhen, Tao, Dingwen
Format:	Preprint
Published:	2024
Subjects:	Hardware Architecture Machine Learning
Online Access:	https://arxiv.org/abs/2412.19829
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917880103174144
author	Zhang, Chengming Ding, Xinheng Sun, Baixi Yu, Xiaodong Zheng, Weijian Xie, Zhen Tao, Dingwen
author_facet	Zhang, Chengming Ding, Xinheng Sun, Baixi Yu, Xiaodong Zheng, Weijian Xie, Zhen Tao, Dingwen
contents	Heterogeneous hardware like Gaudi processor has been developed to enhance computations, especially matrix operations for Transformer-based large language models (LLMs) for generative AI tasks. However, our analysis indicates that Transformers are not fully optimized on such emerging hardware, primarily due to inadequate optimizations in non-matrix computational kernels like Softmax and in heterogeneous resource utilization, particularly when processing long sequences. To address these issues, we propose an integrated approach (called GFormer) that merges sparse and linear attention mechanisms. GFormer aims to maximize the computational capabilities of the Gaudi processor's Matrix Multiplication Engine (MME) and Tensor Processing Cores (TPC) without compromising model quality. GFormer includes a windowed self-attention kernel and an efficient outer product kernel for causal linear attention, aiming to optimize LLM inference on Gaudi processors. Evaluation shows that GFormer significantly improves efficiency and model performance across various tasks on the Gaudi processor and outperforms state-of-the-art GPUs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_19829
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors Zhang, Chengming Ding, Xinheng Sun, Baixi Yu, Xiaodong Zheng, Weijian Xie, Zhen Tao, Dingwen Hardware Architecture Machine Learning Heterogeneous hardware like Gaudi processor has been developed to enhance computations, especially matrix operations for Transformer-based large language models (LLMs) for generative AI tasks. However, our analysis indicates that Transformers are not fully optimized on such emerging hardware, primarily due to inadequate optimizations in non-matrix computational kernels like Softmax and in heterogeneous resource utilization, particularly when processing long sequences. To address these issues, we propose an integrated approach (called GFormer) that merges sparse and linear attention mechanisms. GFormer aims to maximize the computational capabilities of the Gaudi processor's Matrix Multiplication Engine (MME) and Tensor Processing Cores (TPC) without compromising model quality. GFormer includes a windowed self-attention kernel and an efficient outer product kernel for causal linear attention, aiming to optimize LLM inference on Gaudi processors. Evaluation shows that GFormer significantly improves efficiency and model performance across various tasks on the Gaudi processor and outperforms state-of-the-art GPUs.
title	GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors
topic	Hardware Architecture Machine Learning
url	https://arxiv.org/abs/2412.19829

Similar Items