Saved in:
Bibliographic Details
Main Authors: Li, Wanqian, Peng, Jintao, Jing, Zongfei, Zhang, Tianyu, Long, Ze, Qiao, Xianjie, Chen, Xiaoming, Yang, Dongxu, Duan, Kefeng, Yang, June
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.01621
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917486170996736
author Li, Wanqian
Peng, Jintao
Jing, Zongfei
Zhang, Tianyu
Long, Ze
Qiao, Xianjie
Chen, Xiaoming
Yang, Dongxu
Duan, Kefeng
Yang, June
author_facet Li, Wanqian
Peng, Jintao
Jing, Zongfei
Zhang, Tianyu
Long, Ze
Qiao, Xianjie
Chen, Xiaoming
Yang, Dongxu
Duan, Kefeng
Yang, June
contents Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.
format Preprint
id arxiv_https___arxiv_org_abs_2604_01621
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72
Li, Wanqian
Peng, Jintao
Jing, Zongfei
Zhang, Tianyu
Long, Ze
Qiao, Xianjie
Chen, Xiaoming
Yang, Dongxu
Duan, Kefeng
Yang, June
Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.
title DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72
topic Distributed, Parallel, and Cluster Computing
Artificial Intelligence
url https://arxiv.org/abs/2604.01621