Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.01621 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866917486170996736 |
|---|---|
| author | Li, Wanqian Peng, Jintao Jing, Zongfei Zhang, Tianyu Long, Ze Qiao, Xianjie Chen, Xiaoming Yang, Dongxu Duan, Kefeng Yang, June |
| author_facet | Li, Wanqian Peng, Jintao Jing, Zongfei Zhang, Tianyu Long, Ze Qiao, Xianjie Chen, Xiaoming Yang, Dongxu Duan, Kefeng Yang, June |
| contents | Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_01621 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72 Li, Wanqian Peng, Jintao Jing, Zongfei Zhang, Tianyu Long, Ze Qiao, Xianjie Chen, Xiaoming Yang, Dongxu Duan, Kefeng Yang, June Distributed, Parallel, and Cluster Computing Artificial Intelligence Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length. |
| title | DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72 |
| topic | Distributed, Parallel, and Cluster Computing Artificial Intelligence |
| url | https://arxiv.org/abs/2604.01621 |