Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yuan, Kai, Bauinger, Christoph, Zhang, Xiangyi, Baehr, Pascal, Kirchhart, Matthias, Dabert, Darius, Tousnakhoff, Adrien, Boudier, Pierre, Paulitsch, Michael
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2403.17607
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929289369223168
author	Yuan, Kai Bauinger, Christoph Zhang, Xiangyi Baehr, Pascal Kirchhart, Matthias Dabert, Darius Tousnakhoff, Adrien Boudier, Pierre Paulitsch, Michael
author_facet	Yuan, Kai Bauinger, Christoph Zhang, Xiangyi Baehr, Pascal Kirchhart, Matthias Dabert, Darius Tousnakhoff, Adrien Boudier, Pierre Paulitsch, Michael
contents	This paper presents a SYCL implementation of Multi-Layer Perceptrons (MLPs), which targets and is optimized for the Intel Data Center GPU Max 1550. To increase the performance, our implementation minimizes the slow global memory accesses by maximizing the data reuse within the general register file and the shared local memory by fusing the operations in each layer of the MLP. We show with a simple roofline model that this results in a significant increase in the arithmetic intensity, leading to improved performance, especially for inference. We compare our approach to a similar CUDA implementation for MLPs and show that our implementation on the Intel Data Center GPU outperforms the CUDA implementation on Nvidia's H100 GPU by a factor up to 2.84 in inference and 1.75 in training. The paper also showcases the efficiency of our SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning. In all cases, our implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on Nvidia's H100 GPU by up to a factor 19. The code can be found at https://github.com/intel/tiny-dpcpp-nn.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_17607
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs Yuan, Kai Bauinger, Christoph Zhang, Xiangyi Baehr, Pascal Kirchhart, Matthias Dabert, Darius Tousnakhoff, Adrien Boudier, Pierre Paulitsch, Michael Artificial Intelligence This paper presents a SYCL implementation of Multi-Layer Perceptrons (MLPs), which targets and is optimized for the Intel Data Center GPU Max 1550. To increase the performance, our implementation minimizes the slow global memory accesses by maximizing the data reuse within the general register file and the shared local memory by fusing the operations in each layer of the MLP. We show with a simple roofline model that this results in a significant increase in the arithmetic intensity, leading to improved performance, especially for inference. We compare our approach to a similar CUDA implementation for MLPs and show that our implementation on the Intel Data Center GPU outperforms the CUDA implementation on Nvidia's H100 GPU by a factor up to 2.84 in inference and 1.75 in training. The paper also showcases the efficiency of our SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning. In all cases, our implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on Nvidia's H100 GPU by up to a factor 19. The code can be found at https://github.com/intel/tiny-dpcpp-nn.
title	Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs
topic	Artificial Intelligence
url	https://arxiv.org/abs/2403.17607

Similar Items