Saved in:
Bibliographic Details
Main Authors: Sadasivan, Harisankar, Ozturk, Muhammed Emin, Osama, Muhammad, Millette, Chris, Rai, Astha, Podkorytov, Maksim, Afaganis, John, Huang, Carlus, Zhang, Jing, Liu, Jun
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2408.11417
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • General matrix multiplication (GEMM) operations are the fundamental building blocks of computational domains including artificial intelligence (AI). As GPU architectures evolve and high-performance AI becomes increasingly important, optimizing GEMM performance becomes a fundamental problem that needs to be addressed. This paper introduces Stream-K++, an enhancement to the promising Stream-K GEMM scheduling algorithm for workload balancing. We expand Stream-K's scheduling policies from three to seven and implement an efficient solution selection mechanism using Bloom filters. Our approach rapidly eliminates up to 95.8% of unsuitable configurations while maintaining a 100% true-negative rate. Implemented using the AMD Composable Kernel library and evaluated on AMD Instinct MI250X GPUs, Stream-K++ demonstrates significant performance gains (up to 43%) in select scenarios. It remains competitive (within 20% of optimal) for 60-97.6% of problem sizes. Our flexible framework, implemented in the Open-sieve C++ library, allows for easy adaptation to new problem sizes, scheduling policies, or additional tuning parameters, paving the way for future optimizations in GPU-based GEMM operations.