Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ringoot, Evelyne, Alomairy, Rabab, Churavy, Valentin, Edelman, Alan
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing Mathematical Software
Online Access:	https://arxiv.org/abs/2508.06339
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916888028643328
author	Ringoot, Evelyne Alomairy, Rabab Churavy, Valentin Edelman, Alan
author_facet	Ringoot, Evelyne Alomairy, Rabab Churavy, Valentin Edelman, Alan
contents	This paper presents a portable, GPU-accelerated implementation of a QR-based singular value computation algorithm in Julia. The singular value ecomposition (SVD) is a fundamental numerical tool in scientific computing and machine learning, providing optimal low-rank matrix approximations. Its importance has increased even more in large-scale machine learning pipelines, including large language models (LLMs), where it enables low-rank adaptation (LoRA). The implemented algorithm is based on the classic two-stage QR reduction, consisting of successive matrix reduction to band form and bidiagonal form. Our implementation leverages Julia's multiple dispatch and metaprogramming capabilities, integrating with the GPUArrays and KernelAbstractions frameworks to provide a unified type and hardware-agnostic function. It supports diverse GPU architectures and data types, and is, to our knowledge, the first GPU-accelerated singular value implementation to support Apple Metal GPUs and half precision. Performance results on multiple GPU backends and data types demonstrate that portability does not require sacrificing performance: the unified function outperforms most linear algebra libraries (MAGMA, SLATE, rocSOLVER, oneMKL) for matrix sizes larger than 1024x1024, and achieves 80%-90% of the performance of cuSOLVER for large matrices.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_06339
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision Ringoot, Evelyne Alomairy, Rabab Churavy, Valentin Edelman, Alan Distributed, Parallel, and Cluster Computing Mathematical Software This paper presents a portable, GPU-accelerated implementation of a QR-based singular value computation algorithm in Julia. The singular value ecomposition (SVD) is a fundamental numerical tool in scientific computing and machine learning, providing optimal low-rank matrix approximations. Its importance has increased even more in large-scale machine learning pipelines, including large language models (LLMs), where it enables low-rank adaptation (LoRA). The implemented algorithm is based on the classic two-stage QR reduction, consisting of successive matrix reduction to band form and bidiagonal form. Our implementation leverages Julia's multiple dispatch and metaprogramming capabilities, integrating with the GPUArrays and KernelAbstractions frameworks to provide a unified type and hardware-agnostic function. It supports diverse GPU architectures and data types, and is, to our knowledge, the first GPU-accelerated singular value implementation to support Apple Metal GPUs and half precision. Performance results on multiple GPU backends and data types demonstrate that portability does not require sacrificing performance: the unified function outperforms most linear algebra libraries (MAGMA, SLATE, rocSOLVER, oneMKL) for matrix sizes larger than 1024x1024, and achieves 80%-90% of the performance of cuSOLVER for large matrices.
title	Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision
topic	Distributed, Parallel, and Cluster Computing Mathematical Software
url	https://arxiv.org/abs/2508.06339

Similar Items