Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Kim, Seungchan, Kim, Jihoo, Ha, Sanghyun, You, Donghyun
Formato:	Preprint
Publicado:	2025
Materias:	Computational Physics
Acceso en línea:	https://arxiv.org/abs/2509.03933
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866918135660019712
author	Kim, Seungchan Kim, Jihoo Ha, Sanghyun You, Donghyun
author_facet	Kim, Seungchan Kim, Jihoo Ha, Sanghyun You, Donghyun
contents	A tridiagonal matrix algorithm (TDMA), Pipelined-TDMA, is developed for multi-GPU systems to resolve the scalability bottlenecks caused by the sequential structure of conventional divide-and-conquer TDMA. The proposed method pipelines multiple tridiagonal systems, overlapping communication with computation and executing GPU kernels concurrently to hide non-scalable stages behind scalable compute stages. To maximize performance, the batch size is optimized to strike a balance between GPU occupancy and pipeline efficiency: larger batches improve throughput for solving tridiagonal systems, while excessively large batches reduce pipeline utilization. Performance evaluations on up to 64 NVIDIA A100 GPUs using a one-dimensional (1D) slab-type domain decomposition confirm that, except for the terminal phase of the pipeline, the proposed method successfully hides most of the non-scalable execution time-specifically inter-GPU communication and low-occupancy computation. The solver achieves ideal weak scaling up to 64 GPUs with one billion grid cells per GPU and reaches 74.7 percent of ideal performance in strong scaling tests for a 4-billion-cell problem, relative to a 4-GPU baseline. The optimized TDMA is integrated into an ADI-based fractional-step method to remove the scalability bottleneck in the Poisson solver of the flow solver (Ha et al., 2021). In a 9-billion-cell simulation on 64 GPUs, the TDMA component in the Poisson solver is accelerated by 4.37x, contributing to a 1.31x overall speedup of the complete flow solver.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_03933
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	A Highly Scalable TDMA for GPUs and Its Application to Flow Solver Optimization Kim, Seungchan Kim, Jihoo Ha, Sanghyun You, Donghyun Computational Physics A tridiagonal matrix algorithm (TDMA), Pipelined-TDMA, is developed for multi-GPU systems to resolve the scalability bottlenecks caused by the sequential structure of conventional divide-and-conquer TDMA. The proposed method pipelines multiple tridiagonal systems, overlapping communication with computation and executing GPU kernels concurrently to hide non-scalable stages behind scalable compute stages. To maximize performance, the batch size is optimized to strike a balance between GPU occupancy and pipeline efficiency: larger batches improve throughput for solving tridiagonal systems, while excessively large batches reduce pipeline utilization. Performance evaluations on up to 64 NVIDIA A100 GPUs using a one-dimensional (1D) slab-type domain decomposition confirm that, except for the terminal phase of the pipeline, the proposed method successfully hides most of the non-scalable execution time-specifically inter-GPU communication and low-occupancy computation. The solver achieves ideal weak scaling up to 64 GPUs with one billion grid cells per GPU and reaches 74.7 percent of ideal performance in strong scaling tests for a 4-billion-cell problem, relative to a 4-GPU baseline. The optimized TDMA is integrated into an ADI-based fractional-step method to remove the scalability bottleneck in the Poisson solver of the flow solver (Ha et al., 2021). In a 9-billion-cell simulation on 64 GPUs, the TDMA component in the Poisson solver is accelerated by 4.37x, contributing to a 1.31x overall speedup of the complete flow solver.
title	A Highly Scalable TDMA for GPUs and Its Application to Flow Solver Optimization
topic	Computational Physics
url	https://arxiv.org/abs/2509.03933

Ejemplares similares