Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ren, Feng, Qin, Ruoyu, Ma, Teng, Cai, Shangming, Liu, Zheng, Lei, Chao, Zhu, Dejiang, Yang, Ke, Li, Zheming, Cui, Jialei, Huang, Weixiao, Zhao, Yikai, Zhang, Yineng, Wu, Hao, Gao, Xiang, Fu, Yuhao, Jiang, Jinlei, Wu, Yongwei, Zhang, Mingxing
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2604.00368
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915904395149312
author	Ren, Feng Qin, Ruoyu Ma, Teng Cai, Shangming Liu, Zheng Lei, Chao Zhu, Dejiang Yang, Ke Li, Zheming Cui, Jialei Huang, Weixiao Zhao, Yikai Zhang, Yineng Wu, Hao Gao, Xiang Fu, Yuhao Jiang, Jinlei Wu, Yongwei Zhang, Mingxing
author_facet	Ren, Feng Qin, Ruoyu Ma, Teng Cai, Shangming Liu, Zheng Lei, Chao Zhu, Dejiang Yang, Ke Li, Zheming Cui, Jialei Huang, Weixiao Zhao, Yikai Zhang, Yineng Wu, Hao Gao, Xiang Fu, Yuhao Jiang, Jinlei Wu, Yongwei Zhang, Mingxing
contents	Modern GPU clusters are built upon a complex hierarchy of heterogeneous interconnects, ranging from multi-rail RDMA to proprietary fabrics such as Multi-Node NVLink and Ascend UB. Orchestrating these diverse links effectively remains a critical challenge in disaggregated LLM serving. Operating Mooncake TE on thousands of GPUs exposed a critical limitation shared by existing frameworks: imperative, statically bound path selection. This rigidity forces engines to rely on state-blind striping that ignores congestion signals, creating communication silos, wasting multi-rail bandwidth due to head-of-line blocking, and leading to operational fragility where routine faults require manual intervention. We present TENT, a data-movement engine that decouples transfer intent from physical execution. Instead of locking workloads to fixed backends, TENT unifies heterogeneous interconnects into a single dynamic resource pool. Applications simply declare transfer intents, while TENT dynamically decomposes elephant flows into fine-grained slices and "sprays" them across links based on instantaneous link quality. This telemetry-driven orchestration eliminates head-of-line blocking and enables transparent, sub-50 ms self-healing by rerouting slices around failures without application logic. TENT serves as the production data plane for LLM inference and RL pipelines at multiple industrial sites. Our evaluation on H800 HGX clusters shows that TENT outperforms state-of-the-art baselines, including Mooncake TE, NIXL, and UCCL. In LLM inference with SGLang HiCache, TENT achieves up to 1.36x higher throughput and 26% lower P90 TTFT than Mooncake TE. In RL pipelines, TENT accelerates parameter updates in Moonshot Checkpoint Engine by 20-26%.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_00368
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	TENT: A Declarative Slice Spraying Engine for Performant and Resilient Data Movement in Disaggregated LLM Serving Ren, Feng Qin, Ruoyu Ma, Teng Cai, Shangming Liu, Zheng Lei, Chao Zhu, Dejiang Yang, Ke Li, Zheming Cui, Jialei Huang, Weixiao Zhao, Yikai Zhang, Yineng Wu, Hao Gao, Xiang Fu, Yuhao Jiang, Jinlei Wu, Yongwei Zhang, Mingxing Distributed, Parallel, and Cluster Computing Modern GPU clusters are built upon a complex hierarchy of heterogeneous interconnects, ranging from multi-rail RDMA to proprietary fabrics such as Multi-Node NVLink and Ascend UB. Orchestrating these diverse links effectively remains a critical challenge in disaggregated LLM serving. Operating Mooncake TE on thousands of GPUs exposed a critical limitation shared by existing frameworks: imperative, statically bound path selection. This rigidity forces engines to rely on state-blind striping that ignores congestion signals, creating communication silos, wasting multi-rail bandwidth due to head-of-line blocking, and leading to operational fragility where routine faults require manual intervention. We present TENT, a data-movement engine that decouples transfer intent from physical execution. Instead of locking workloads to fixed backends, TENT unifies heterogeneous interconnects into a single dynamic resource pool. Applications simply declare transfer intents, while TENT dynamically decomposes elephant flows into fine-grained slices and "sprays" them across links based on instantaneous link quality. This telemetry-driven orchestration eliminates head-of-line blocking and enables transparent, sub-50 ms self-healing by rerouting slices around failures without application logic. TENT serves as the production data plane for LLM inference and RL pipelines at multiple industrial sites. Our evaluation on H800 HGX clusters shows that TENT outperforms state-of-the-art baselines, including Mooncake TE, NIXL, and UCCL. In LLM inference with SGLang HiCache, TENT achieves up to 1.36x higher throughput and 26% lower P90 TTFT than Mooncake TE. In RL pipelines, TENT accelerates parameter updates in Moonshot Checkpoint Engine by 20-26%.
title	TENT: A Declarative Slice Spraying Engine for Performant and Resilient Data Movement in Disaggregated LLM Serving
topic	Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2604.00368

Similar Items