書本目錄: :: Library Catalog

Saved in:

書目詳細資料
Main Authors:	Tian, Jian, Li, Shuailong, Cao, Yang, Cui, Wenbo, Zhu, Minghan, Wu, Wenkang, Zhang, Jianming, Wang, Yanpeng, Xiao, Zhiwen, Hou, Zhenyu, Shen, Dou
格式:	Preprint
出版:	2025
主題:	Distributed, Parallel, and Cluster Computing Machine Learning
在線閱讀:	https://arxiv.org/abs/2512.16134
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!

書本目錄:

The evolution of Large Language Model (LLM) serving towards complex, distributed architectures--specifically the P/D-separated, large-scale DP+EP paradigm--introduces distinct scheduling challenges. Unlike traditional deployments where schedulers can treat instances as black boxes, DP+EP architectures exhibit high internal synchronization costs. We identify that immediate request dispatching in such systems leads to severe in-engine queuing and parallelization bubbles, degrading Time-to-First-Token (TTFT). To address this, we propose Staggered Batch Scheduling (SBS), a mechanism that deliberately buffers requests to form optimal execution batches. This temporal decoupling eliminates internal queuing bubbles without compromising throughput. Furthermore, leveraging the scheduling window created by buffering, we introduce a Load-Aware Global Allocation strategy that balances computational load across DP units for both Prefill and Decode phases. Deployed on a production H800 cluster serving Deepseek-V3, our system reduces TTFT by 30%-40% and improves throughput by 15%-20% compared to state-of-the-art immediate scheduling baselines.

相似書籍