MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Gui, Haoyuan, Zhang, Xiaoyu, Zhang, Chong, Su, Zitong, Li, Huiyuan
Natura:	Preprint
Pubblicazione:	2024
Soggetti:	Performance
Accesso online:	https://arxiv.org/abs/2411.16152
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866917879500242944
author	Gui, Haoyuan Zhang, Xiaoyu Zhang, Chong Su, Zitong Li, Huiyuan
author_facet	Gui, Haoyuan Zhang, Xiaoyu Zhang, Chong Su, Zitong Li, Huiyuan
contents	As Convolutional Neural Networks (CNNs) gain prominence in deep learning, algorithms like Winograd Convolution have been introduced to enhance computational efficiency. However, existing implementations often face challenges such as high transformation overhead, suboptimal computation efficiency, and reduced parallel performance in some layers. We propose a fused Winograd Convolution algorithm optimized for ARMv8 CPUs, integrating input transformation, filter transformation, computation, and output transformation into a single pipeline. By maintaining consecutive memory access and using a custom z-shaped data layout, our approach fully utilizes an optimized GEMM micro-kernel with a ping-pong technique. Additionally, we introduce a multi-dimensional parallel strategy that adapts to convolutional layer scales. To maximize performance, we manually optimize each kernel in AArch64 assembly and carefully tune blocking parameters. Experimental results show speedups of up to 4.74x, 4.10x, 4.72x, and 10.57x over NCNN, NNPACK, FastConv, and ACL on the Kunpeng 920 platform using multiple threads, with respective gains of 3.85x, 2.81x, 4.20x, and 7.80x on the AWS Graviton2, and 3.32x, 3.68x, 8.00x, and 9.28x on the Phytium 2000+.
format	Preprint
id	arxiv_https___arxiv_org_abs_2411_16152
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Optimizing Winograd Convolution on ARMv8 processors Gui, Haoyuan Zhang, Xiaoyu Zhang, Chong Su, Zitong Li, Huiyuan Performance As Convolutional Neural Networks (CNNs) gain prominence in deep learning, algorithms like Winograd Convolution have been introduced to enhance computational efficiency. However, existing implementations often face challenges such as high transformation overhead, suboptimal computation efficiency, and reduced parallel performance in some layers. We propose a fused Winograd Convolution algorithm optimized for ARMv8 CPUs, integrating input transformation, filter transformation, computation, and output transformation into a single pipeline. By maintaining consecutive memory access and using a custom z-shaped data layout, our approach fully utilizes an optimized GEMM micro-kernel with a ping-pong technique. Additionally, we introduce a multi-dimensional parallel strategy that adapts to convolutional layer scales. To maximize performance, we manually optimize each kernel in AArch64 assembly and carefully tune blocking parameters. Experimental results show speedups of up to 4.74x, 4.10x, 4.72x, and 10.57x over NCNN, NNPACK, FastConv, and ACL on the Kunpeng 920 platform using multiple threads, with respective gains of 3.85x, 2.81x, 4.20x, and 7.80x on the AWS Graviton2, and 3.32x, 3.68x, 8.00x, and 9.28x on the Phytium 2000+.
title	Optimizing Winograd Convolution on ARMv8 processors
topic	Performance
url	https://arxiv.org/abs/2411.16152

Documenti analoghi