Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Davis, Ethan
Format:	Preprint
Published:	2025
Subjects:	Performance
Online Access:	https://arxiv.org/abs/2509.04594
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915480153882624
author	Davis, Ethan
author_facet	Davis, Ethan
contents	Matrix multiplication is the foundation from much of the success from high performance technologies like deep learning, scientific simulations, and video graphics. High level programming languages like Python and R rely on highly optimized low level libraries for performing core linear algebra operations like matrix multiplication from Basic Linear Algebra Subprograms (BLAS). This paper compares the performance of five different matrix multiplication algorithms using CuBLAS, CUDA, BLAS, OpenMP, and C++ Threads. We find statistical significance with a p-value below 5e-12 to support the hypothesis that for square $N \times N$ matrices where $N$ is at least 10,000 then the in order performance as measured in floating point operations per second (FLOPS) for these matrix multiplication algorithms is CuBLAS, CUDA, BLAS, OpenMP, and C++ Threads.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_04594
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	High Performance Matrix Multiplication Davis, Ethan Performance Matrix multiplication is the foundation from much of the success from high performance technologies like deep learning, scientific simulations, and video graphics. High level programming languages like Python and R rely on highly optimized low level libraries for performing core linear algebra operations like matrix multiplication from Basic Linear Algebra Subprograms (BLAS). This paper compares the performance of five different matrix multiplication algorithms using CuBLAS, CUDA, BLAS, OpenMP, and C++ Threads. We find statistical significance with a p-value below 5e-12 to support the hypothesis that for square $N \times N$ matrices where $N$ is at least 10,000 then the in order performance as measured in floating point operations per second (FLOPS) for these matrix multiplication algorithms is CuBLAS, CUDA, BLAS, OpenMP, and C++ Threads.
title	High Performance Matrix Multiplication
topic	Performance
url	https://arxiv.org/abs/2509.04594

Similar Items