Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Uchino, Yuki, Ma, Qianxiang, Imamura, Toshiyuki, Ozaki, Katsuhisa, Gutsche, Patrick Lars
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2512.08321
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909966009368576
author	Uchino, Yuki Ma, Qianxiang Imamura, Toshiyuki Ozaki, Katsuhisa Gutsche, Patrick Lars
author_facet	Uchino, Yuki Ma, Qianxiang Imamura, Toshiyuki Ozaki, Katsuhisa Gutsche, Patrick Lars
contents	Modern computing architectures feature low-precision matrix multiplication units that achieve substantially higher throughput than their high-precision counterparts. Motivated by this architectural trend, the emulation of high-precision matrix multiplication using low-precision hardware has attracted significant interest in the high-performance computing community. Ozaki, Uchino, and Imamura proposed the Ozaki-II scheme as a general framework for emulating matrix multiplication. Building on this framework, Uchino, Ozaki, and Imamura developed high-performance and power-efficient techniques for emulating single- and double-precision real matrix multiplication on INT8 matrix engines. Extending this line of research, the present study proposes high-performance emulation methods for single- and double-precision complex matrix multiplication on INT8 matrix engines, based on the Ozaki-II scheme. On an NVIDIA B200 GPU, the proposed methods achieve 4.4--6.5x and 4.0--5.6x speedups over the native single- and double-precision complex matrix multiplication routines from cuBLAS, respectively, for sufficiently large problem sizes. When lower accuracy than that of the standard routines is acceptable, the proposed methods can operate at even higher speed. Conversely, with only a modest increase in computation time, they can deliver higher accuracy than that of the standard routines. These properties suggest that the proposed approach has the potential to serve as a default algorithm across a wide range of applications.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_08321
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem Uchino, Yuki Ma, Qianxiang Imamura, Toshiyuki Ozaki, Katsuhisa Gutsche, Patrick Lars Distributed, Parallel, and Cluster Computing Modern computing architectures feature low-precision matrix multiplication units that achieve substantially higher throughput than their high-precision counterparts. Motivated by this architectural trend, the emulation of high-precision matrix multiplication using low-precision hardware has attracted significant interest in the high-performance computing community. Ozaki, Uchino, and Imamura proposed the Ozaki-II scheme as a general framework for emulating matrix multiplication. Building on this framework, Uchino, Ozaki, and Imamura developed high-performance and power-efficient techniques for emulating single- and double-precision real matrix multiplication on INT8 matrix engines. Extending this line of research, the present study proposes high-performance emulation methods for single- and double-precision complex matrix multiplication on INT8 matrix engines, based on the Ozaki-II scheme. On an NVIDIA B200 GPU, the proposed methods achieve 4.4--6.5x and 4.0--5.6x speedups over the native single- and double-precision complex matrix multiplication routines from cuBLAS, respectively, for sufficiently large problem sizes. When lower accuracy than that of the standard routines is acceptable, the proposed methods can operate at even higher speed. Conversely, with only a modest increase in computation time, they can deliver higher accuracy than that of the standard routines. These properties suggest that the proposed approach has the potential to serve as a default algorithm across a wide range of applications.
title	Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem
topic	Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2512.08321

Similar Items