Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Carrica, Vicki, Onyango, Maxwell, Alomairy, Rabab, Ringoot, Evelyne, Schloss, James, Edelman, Alan
Format:	Preprint
Published:	2025
Subjects:	Mathematical Software Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2504.13821
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916696363630592
author	Carrica, Vicki Onyango, Maxwell Alomairy, Rabab Ringoot, Evelyne Schloss, James Edelman, Alan
author_facet	Carrica, Vicki Onyango, Maxwell Alomairy, Rabab Ringoot, Evelyne Schloss, James Edelman, Alan
contents	This paper presents a performant and portable recursive implementation of triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM) in Julia for GPUs, two kernels that underlie many linear-algebra algorithms. We restructure TRMM and TRSM so that most work is executed as general matrix-matrix multiplication (GEMM), improving use of the GPU memory hierarchy and reducing latency. Exploiting Julia's multiple dispatch and metaprogramming together with the GPUArrays and KernelAbstractions frameworks, we expose a single hardware-agnostic API that runs on NVIDIA, AMD, and Apple Silicon GPUs. For large matrices the recursive code reaches throughput comparable to vendor libraries such as cuBLAS and rocBLAS, while providing these routines on Apple Silicon for the first time. The entire implementation is only a few hundred lines of code, showing that unified Julia programs can deliver near-vendor performance across heterogeneous architectures.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_13821
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM Carrica, Vicki Onyango, Maxwell Alomairy, Rabab Ringoot, Evelyne Schloss, James Edelman, Alan Mathematical Software Distributed, Parallel, and Cluster Computing This paper presents a performant and portable recursive implementation of triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM) in Julia for GPUs, two kernels that underlie many linear-algebra algorithms. We restructure TRMM and TRSM so that most work is executed as general matrix-matrix multiplication (GEMM), improving use of the GPU memory hierarchy and reducing latency. Exploiting Julia's multiple dispatch and metaprogramming together with the GPUArrays and KernelAbstractions frameworks, we expose a single hardware-agnostic API that runs on NVIDIA, AMD, and Apple Silicon GPUs. For large matrices the recursive code reaches throughput comparable to vendor libraries such as cuBLAS and rocBLAS, while providing these routines on Apple Silicon for the first time. The entire implementation is only a few hundred lines of code, showing that unified Julia programs can deliver near-vendor performance across heterogeneous architectures.
title	Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM
topic	Mathematical Software Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2504.13821

Similar Items