Salvato in:
Dettagli Bibliografici
Autori principali: Xiao, Ao, He, Bangzheng, Zhang, Baoquan, Huai, Baoxing, Wang, Bingji, Wang, Bo, Xu, Bo, Hou, Boyi, Yang, Chan, Liu, Changhong, Cui, Cheng, Zhu, Chenyu, Feng, Cong, Wang, Daohui, Lin, Dayun, Zhao, Duo, Zou, Fengshao, Wang, Fu, Zhang, Gangqiang, Dan, Gengyuan, Chen, Guanjie, Guan, Guodong, Yang, Guodong, Li, Haifeng, Zhu, Haipei, Li, Haley, Feng, Hao, Huang, Hao, Xu, Hao, Ma, Hengrui, Fan, Hengtao, Liu, Hui, Li, Jia, Liu, Jiang, Xu, Jiang, Meng, Jie, Xin, Jinhan, Hu, Junhao, Chen, Juwei, Yu, Lan, Miao, Lanxin, Liu, Liang, Jing, Linan, Zhou, Lu, Han, Meina, Deng, Mingkun, Deng, Mingyu, Deng, Naitian, Lin, Nizhong, Zhao, Peihan, Pan, Peng, Shen, Pengfei, Li, Ping, Zhang, Qi, Wang, Qian, Xia, Qin ZhC Qingrong, Zhang, Qingyi, Fu, Qunchao, Guo, Ren, Gao, Ruimin, Li, Shaochun, Long, Sheng, Li, Shentian, Wan, Shining, Shen, Shuai, Zeng, Shuangfu, Jing, Shuming, Yang, Siqi, Zhang, Song, Xu, Tao, Du, Tianlin, Chen, Ting, Wu, Wanxu, Jiang, Wei, Tong, Weinan, Chen, Weiwei, Peng, Wen, Zhou, Wenli, Yang, Wenquan, Liang, Wenxin, Liu, Xiang, Zhou, Xiaoli, Jin, Xin, Duan, Xinyu, Li, Xu, Zhang, Xu, Chen, Xusheng, Shan, Yalong, Gan, Yang, Lu, Yao, Deng, Yi, Zheng, Yi, Xiong, Ying, Zheng, Yingfei, Zheng, Yiyun, Shan, Yizhou, Gao, Yong, Zhang, Yong, Yang, Yongqiang, Gong, Yuanjin, Yu, Yue, Chen, Yuetao, Zhu, Yukun, He, Yulong, Zhao, Yusu, Wu, Yuyan, Zhang, Zenan, Zhuo, Zhaojin, Ji, Zhaoyang, Wang, Zhefeng, Wang, Zheng, Fan, Zhenan, Yang, Zhenhua, Sheng, Zhenli, Yu, Zhibin, Ji, Zhigang, Ren, Zhihao, Bian, Zhipeng, Liu, Zhixia, Dong, Zhiyu, Li, Zhonghua, Yu, Zhou, Shen, Zhuoming, Peng, Zhuwei, Ye, Zi, Xiang, Zihao, Fu, Zimin, Zhang, Zixuan
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2508.02520
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
Sommario:
  • Scaled-out MoE LLMs and scaled-up SuperPods create new systems challenges for production Model-as-a-Service (MaaS), requiring disaggregation, low-latency communication, and decentralized serving. This report presents xDeepServe, the production serving system behind Huawei Cloud's MaaS offering on CloudMatrix384, a 48-server SuperPod with 384 Ascend 910C chips connected by a high-bandwidth UB fabric and global shared memory. It serves models including DeepSeek, Kimi, GLM, Qwen, and MiniMax, among others. xDeepServe is built around Transformerless, a disaggregated execution architecture that decomposes transformer inference into modular units -- attention, feedforward, and MoE -- and supports disaggregated Prefill-Decode and MoE-Attention deployments. To enable disaggregation, we develop XCCL, a memory-semantic communication layer providing microsecond-level point-to-point and scalable all-to-all primitives, and we extend FlowServe with decentralized DP groups and techniques to mitigate stragglers and synchronization variance. In a peak decoding configuration, xDeepServe reaches 2400 tokens/s per Ascend 910C chip at ~50ms time-per-output-token (TPOT).