Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Sadasivan, Harisankar, Ozturk, Muhammed Emin, Osama, Muhammad, Millette, Chris, Rai, Astha, Podkorytov, Maksim, Afaganis, John, Huang, Carlus, Zhang, Jing, Liu, Jun
Format:	Preprint
Published:	2024
Subjects:	Distributed, Parallel, and Cluster Computing D.2; I.2
Online Access:	https://arxiv.org/abs/2408.11417
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

General matrix multiplication (GEMM) operations are the fundamental building blocks of computational domains including artificial intelligence (AI). As GPU architectures evolve and high-performance AI becomes increasingly important, optimizing GEMM performance becomes a fundamental problem that needs to be addressed. This paper introduces Stream-K++, an enhancement to the promising Stream-K GEMM scheduling algorithm for workload balancing. We expand Stream-K's scheduling policies from three to seven and implement an efficient solution selection mechanism using Bloom filters. Our approach rapidly eliminates up to 95.8% of unsuitable configurations while maintaining a 100% true-negative rate. Implemented using the AMD Composable Kernel library and evaluated on AMD Instinct MI250X GPUs, Stream-K++ demonstrates significant performance gains (up to 43%) in select scenarios. It remains competitive (within 20% of optimal) for 60-97.6% of problem sizes. Our flexible framework, implemented in the Open-sieve C++ library, allows for easy adaptation to new problem sizes, scheduling policies, or additional tuning parameters, paving the way for future optimizations in GPU-based GEMM operations.

Similar Items