Saved in:
Bibliographic Details
Main Authors: Ringoot, Evelyne, Alomairy, Rabab, Churavy, Valentin, Edelman, Alan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2508.06339
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916888028643328
author Ringoot, Evelyne
Alomairy, Rabab
Churavy, Valentin
Edelman, Alan
author_facet Ringoot, Evelyne
Alomairy, Rabab
Churavy, Valentin
Edelman, Alan
contents This paper presents a portable, GPU-accelerated implementation of a QR-based singular value computation algorithm in Julia. The singular value ecomposition (SVD) is a fundamental numerical tool in scientific computing and machine learning, providing optimal low-rank matrix approximations. Its importance has increased even more in large-scale machine learning pipelines, including large language models (LLMs), where it enables low-rank adaptation (LoRA). The implemented algorithm is based on the classic two-stage QR reduction, consisting of successive matrix reduction to band form and bidiagonal form. Our implementation leverages Julia's multiple dispatch and metaprogramming capabilities, integrating with the GPUArrays and KernelAbstractions frameworks to provide a unified type and hardware-agnostic function. It supports diverse GPU architectures and data types, and is, to our knowledge, the first GPU-accelerated singular value implementation to support Apple Metal GPUs and half precision. Performance results on multiple GPU backends and data types demonstrate that portability does not require sacrificing performance: the unified function outperforms most linear algebra libraries (MAGMA, SLATE, rocSOLVER, oneMKL) for matrix sizes larger than 1024x1024, and achieves 80%-90% of the performance of cuSOLVER for large matrices.
format Preprint
id arxiv_https___arxiv_org_abs_2508_06339
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision
Ringoot, Evelyne
Alomairy, Rabab
Churavy, Valentin
Edelman, Alan
Distributed, Parallel, and Cluster Computing
Mathematical Software
This paper presents a portable, GPU-accelerated implementation of a QR-based singular value computation algorithm in Julia. The singular value ecomposition (SVD) is a fundamental numerical tool in scientific computing and machine learning, providing optimal low-rank matrix approximations. Its importance has increased even more in large-scale machine learning pipelines, including large language models (LLMs), where it enables low-rank adaptation (LoRA). The implemented algorithm is based on the classic two-stage QR reduction, consisting of successive matrix reduction to band form and bidiagonal form. Our implementation leverages Julia's multiple dispatch and metaprogramming capabilities, integrating with the GPUArrays and KernelAbstractions frameworks to provide a unified type and hardware-agnostic function. It supports diverse GPU architectures and data types, and is, to our knowledge, the first GPU-accelerated singular value implementation to support Apple Metal GPUs and half precision. Performance results on multiple GPU backends and data types demonstrate that portability does not require sacrificing performance: the unified function outperforms most linear algebra libraries (MAGMA, SLATE, rocSOLVER, oneMKL) for matrix sizes larger than 1024x1024, and achieves 80%-90% of the performance of cuSOLVER for large matrices.
title Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision
topic Distributed, Parallel, and Cluster Computing
Mathematical Software
url https://arxiv.org/abs/2508.06339