Saved in:
Bibliographic Details
Main Author: Davis, Ethan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.04594
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915480153882624
author Davis, Ethan
author_facet Davis, Ethan
contents Matrix multiplication is the foundation from much of the success from high performance technologies like deep learning, scientific simulations, and video graphics. High level programming languages like Python and R rely on highly optimized low level libraries for performing core linear algebra operations like matrix multiplication from Basic Linear Algebra Subprograms (BLAS). This paper compares the performance of five different matrix multiplication algorithms using CuBLAS, CUDA, BLAS, OpenMP, and C++ Threads. We find statistical significance with a p-value below 5e-12 to support the hypothesis that for square $N \times N$ matrices where $N$ is at least 10,000 then the in order performance as measured in floating point operations per second (FLOPS) for these matrix multiplication algorithms is CuBLAS, CUDA, BLAS, OpenMP, and C++ Threads.
format Preprint
id arxiv_https___arxiv_org_abs_2509_04594
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle High Performance Matrix Multiplication
Davis, Ethan
Performance
Matrix multiplication is the foundation from much of the success from high performance technologies like deep learning, scientific simulations, and video graphics. High level programming languages like Python and R rely on highly optimized low level libraries for performing core linear algebra operations like matrix multiplication from Basic Linear Algebra Subprograms (BLAS). This paper compares the performance of five different matrix multiplication algorithms using CuBLAS, CUDA, BLAS, OpenMP, and C++ Threads. We find statistical significance with a p-value below 5e-12 to support the hypothesis that for square $N \times N$ matrices where $N$ is at least 10,000 then the in order performance as measured in floating point operations per second (FLOPS) for these matrix multiplication algorithms is CuBLAS, CUDA, BLAS, OpenMP, and C++ Threads.
title High Performance Matrix Multiplication
topic Performance
url https://arxiv.org/abs/2509.04594