Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Wanqian, Peng, Jintao, Jing, Zongfei, Zhang, Tianyu, Long, Ze, Qiao, Xianjie, Chen, Xiaoming, Yang, Dongxu, Duan, Kefeng, Yang, June
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.01621
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917486170996736
author	Li, Wanqian Peng, Jintao Jing, Zongfei Zhang, Tianyu Long, Ze Qiao, Xianjie Chen, Xiaoming Yang, Dongxu Duan, Kefeng Yang, June
author_facet	Li, Wanqian Peng, Jintao Jing, Zongfei Zhang, Tianyu Long, Ze Qiao, Xianjie Chen, Xiaoming Yang, Dongxu Duan, Kefeng Yang, June
contents	Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_01621
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72 Li, Wanqian Peng, Jintao Jing, Zongfei Zhang, Tianyu Long, Ze Qiao, Xianjie Chen, Xiaoming Yang, Dongxu Duan, Kefeng Yang, June Distributed, Parallel, and Cluster Computing Artificial Intelligence Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.
title	DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72
topic	Distributed, Parallel, and Cluster Computing Artificial Intelligence
url	https://arxiv.org/abs/2604.01621

Similar Items