MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Haris, Jude, Agostini, Nicolas Bohm, Tumeo, Antonino, Kaeli, David, Cano, José
Natura:	Preprint
Pubblicazione:	2024
Soggetti:	Programming Languages
Accesso online:	https://arxiv.org/abs/2402.19184
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866917601021526016
author	Haris, Jude Agostini, Nicolas Bohm Tumeo, Antonino Kaeli, David Cano, José
author_facet	Haris, Jude Agostini, Nicolas Bohm Tumeo, Antonino Kaeli, David Cano, José
contents	As custom hardware accelerators become more prevalent, it becomes increasingly important to automatically generate efficient host-driver code that can fully leverage the capabilities of these accelerators. This approach saves time and reduces the likelihood of errors that can occur during manual implementation. AXI4MLIR extends the MLIR compiler framework to generate host-driver code for custom accelerators for linear algebra problems. By leveraging specific compiler optimizations, we can further increase accelerator utilization. In this work we offer two key observations through a MatMul accelerator case study. First, the accelerator's compute core utilization is less than 10%, and second, the critical latency bottleneck is caused by copying data between the heap and memory-mapped DMA buffers. We identify a set of missing host code optimizations to improve the under-utilization and the latency bottleneck. Therefore, we propose three key host-code data-movement-related optimizations, extending AXI4MLIR. The optimizations provide DMA-based data allocation, coalescing of DMA transfers, and pipelining of the accelerator's load, compute, and store stages.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_19184
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR Haris, Jude Agostini, Nicolas Bohm Tumeo, Antonino Kaeli, David Cano, José Programming Languages As custom hardware accelerators become more prevalent, it becomes increasingly important to automatically generate efficient host-driver code that can fully leverage the capabilities of these accelerators. This approach saves time and reduces the likelihood of errors that can occur during manual implementation. AXI4MLIR extends the MLIR compiler framework to generate host-driver code for custom accelerators for linear algebra problems. By leveraging specific compiler optimizations, we can further increase accelerator utilization. In this work we offer two key observations through a MatMul accelerator case study. First, the accelerator's compute core utilization is less than 10%, and second, the critical latency bottleneck is caused by copying data between the heap and memory-mapped DMA buffers. We identify a set of missing host code optimizations to improve the under-utilization and the latency bottleneck. Therefore, we propose three key host-code data-movement-related optimizations, extending AXI4MLIR. The optimizations provide DMA-based data allocation, coalescing of DMA transfers, and pipelining of the accelerator's load, compute, and store stages.
title	Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR
topic	Programming Languages
url	https://arxiv.org/abs/2402.19184

Documenti analoghi