Saved in:
Bibliographic Details
Main Authors: Xiao, Ao, He, Bangzheng, Zhang, Baoquan, Huai, Baoxing, Wang, Bingji, Wang, Bo, Xu, Bo, Hou, Boyi, Yang, Chan, Liu, Changhong, Cui, Cheng, Zhu, Chenyu, Feng, Cong, Wang, Daohui, Lin, Dayun, Zhao, Duo, Zou, Fengshao, Wang, Fu, Zhang, Gangqiang, Dan, Gengyuan, Chen, Guanjie, Guan, Guodong, Yang, Guodong, Li, Haifeng, Zhu, Haipei, Li, Haley, Feng, Hao, Huang, Hao, Xu, Hao, Ma, Hengrui, Fan, Hengtao, Liu, Hui, Li, Jia, Liu, Jiang, Xu, Jiang, Meng, Jie, Xin, Jinhan, Hu, Junhao, Chen, Juwei, Yu, Lan, Miao, Lanxin, Liu, Liang, Jing, Linan, Zhou, Lu, Han, Meina, Deng, Mingkun, Deng, Mingyu, Deng, Naitian, Lin, Nizhong, Zhao, Peihan, Pan, Peng, Shen, Pengfei, Li, Ping, Zhang, Qi, Wang, Qian, Xia, Qin ZhC Qingrong, Zhang, Qingyi, Fu, Qunchao, Guo, Ren, Gao, Ruimin, Li, Shaochun, Long, Sheng, Li, Shentian, Wan, Shining, Shen, Shuai, Zeng, Shuangfu, Jing, Shuming, Yang, Siqi, Zhang, Song, Xu, Tao, Du, Tianlin, Chen, Ting, Wu, Wanxu, Jiang, Wei, Tong, Weinan, Chen, Weiwei, Peng, Wen, Zhou, Wenli, Yang, Wenquan, Liang, Wenxin, Liu, Xiang, Zhou, Xiaoli, Jin, Xin, Duan, Xinyu, Li, Xu, Zhang, Xu, Chen, Xusheng, Shan, Yalong, Gan, Yang, Lu, Yao, Deng, Yi, Zheng, Yi, Xiong, Ying, Zheng, Yingfei, Zheng, Yiyun, Shan, Yizhou, Gao, Yong, Zhang, Yong, Yang, Yongqiang, Gong, Yuanjin, Yu, Yue, Chen, Yuetao, Zhu, Yukun, He, Yulong, Zhao, Yusu, Wu, Yuyan, Zhang, Zenan, Zhuo, Zhaojin, Ji, Zhaoyang, Wang, Zhefeng, Wang, Zheng, Fan, Zhenan, Yang, Zhenhua, Sheng, Zhenli, Yu, Zhibin, Ji, Zhigang, Ren, Zhihao, Bian, Zhipeng, Liu, Zhixia, Dong, Zhiyu, Li, Zhonghua, Yu, Zhou, Shen, Zhuoming, Peng, Zhuwei, Ye, Zi, Xiang, Zihao, Fu, Zimin, Zhang, Zixuan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2508.02520
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911474974195712
author Xiao, Ao
He, Bangzheng
Zhang, Baoquan
Huai, Baoxing
Wang, Bingji
Wang, Bo
Xu, Bo
Hou, Boyi
Yang, Chan
Liu, Changhong
Cui, Cheng
Zhu, Chenyu
Feng, Cong
Wang, Daohui
Lin, Dayun
Zhao, Duo
Zou, Fengshao
Wang, Fu
Zhang, Gangqiang
Dan, Gengyuan
Chen, Guanjie
Guan, Guodong
Yang, Guodong
Li, Haifeng
Zhu, Haipei
Li, Haley
Feng, Hao
Huang, Hao
Xu, Hao
Ma, Hengrui
Fan, Hengtao
Liu, Hui
Li, Jia
Liu, Jiang
Xu, Jiang
Meng, Jie
Xin, Jinhan
Hu, Junhao
Chen, Juwei
Yu, Lan
Miao, Lanxin
Liu, Liang
Jing, Linan
Zhou, Lu
Han, Meina
Deng, Mingkun
Deng, Mingyu
Deng, Naitian
Lin, Nizhong
Zhao, Peihan
Pan, Peng
Shen, Pengfei
Li, Ping
Zhang, Qi
Wang, Qian
Xia, Qin ZhC Qingrong
Zhang, Qingyi
Fu, Qunchao
Guo, Ren
Gao, Ruimin
Li, Shaochun
Long, Sheng
Li, Shentian
Wan, Shining
Shen, Shuai
Zeng, Shuangfu
Jing, Shuming
Yang, Siqi
Zhang, Song
Xu, Tao
Du, Tianlin
Chen, Ting
Wu, Wanxu
Jiang, Wei
Tong, Weinan
Chen, Weiwei
Peng, Wen
Zhou, Wenli
Yang, Wenquan
Liang, Wenxin
Liu, Xiang
Zhou, Xiaoli
Jin, Xin
Duan, Xinyu
Li, Xu
Zhang, Xu
Chen, Xusheng
Shan, Yalong
Gan, Yang
Lu, Yao
Deng, Yi
Zheng, Yi
Xiong, Ying
Zheng, Yingfei
Zheng, Yiyun
Shan, Yizhou
Gao, Yong
Zhang, Yong
Yang, Yongqiang
Gong, Yuanjin
Yu, Yue
Chen, Yuetao
Zhu, Yukun
He, Yulong
Zhao, Yusu
Wu, Yuyan
Zhang, Zenan
Zhuo, Zhaojin
Ji, Zhaoyang
Wang, Zhefeng
Wang, Zheng
Fan, Zhenan
Yang, Zhenhua
Sheng, Zhenli
Yu, Zhibin
Ji, Zhigang
Ren, Zhihao
Bian, Zhipeng
Liu, Zhixia
Dong, Zhiyu
Li, Zhonghua
Yu, Zhou
Shen, Zhuoming
Peng, Zhuwei
Ye, Zi
Xiang, Zihao
Fu, Zimin
Zhang, Zixuan
author_facet Xiao, Ao
He, Bangzheng
Zhang, Baoquan
Huai, Baoxing
Wang, Bingji
Wang, Bo
Xu, Bo
Hou, Boyi
Yang, Chan
Liu, Changhong
Cui, Cheng
Zhu, Chenyu
Feng, Cong
Wang, Daohui
Lin, Dayun
Zhao, Duo
Zou, Fengshao
Wang, Fu
Zhang, Gangqiang
Dan, Gengyuan
Chen, Guanjie
Guan, Guodong
Yang, Guodong
Li, Haifeng
Zhu, Haipei
Li, Haley
Feng, Hao
Huang, Hao
Xu, Hao
Ma, Hengrui
Fan, Hengtao
Liu, Hui
Li, Jia
Liu, Jiang
Xu, Jiang
Meng, Jie
Xin, Jinhan
Hu, Junhao
Chen, Juwei
Yu, Lan
Miao, Lanxin
Liu, Liang
Jing, Linan
Zhou, Lu
Han, Meina
Deng, Mingkun
Deng, Mingyu
Deng, Naitian
Lin, Nizhong
Zhao, Peihan
Pan, Peng
Shen, Pengfei
Li, Ping
Zhang, Qi
Wang, Qian
Xia, Qin ZhC Qingrong
Zhang, Qingyi
Fu, Qunchao
Guo, Ren
Gao, Ruimin
Li, Shaochun
Long, Sheng
Li, Shentian
Wan, Shining
Shen, Shuai
Zeng, Shuangfu
Jing, Shuming
Yang, Siqi
Zhang, Song
Xu, Tao
Du, Tianlin
Chen, Ting
Wu, Wanxu
Jiang, Wei
Tong, Weinan
Chen, Weiwei
Peng, Wen
Zhou, Wenli
Yang, Wenquan
Liang, Wenxin
Liu, Xiang
Zhou, Xiaoli
Jin, Xin
Duan, Xinyu
Li, Xu
Zhang, Xu
Chen, Xusheng
Shan, Yalong
Gan, Yang
Lu, Yao
Deng, Yi
Zheng, Yi
Xiong, Ying
Zheng, Yingfei
Zheng, Yiyun
Shan, Yizhou
Gao, Yong
Zhang, Yong
Yang, Yongqiang
Gong, Yuanjin
Yu, Yue
Chen, Yuetao
Zhu, Yukun
He, Yulong
Zhao, Yusu
Wu, Yuyan
Zhang, Zenan
Zhuo, Zhaojin
Ji, Zhaoyang
Wang, Zhefeng
Wang, Zheng
Fan, Zhenan
Yang, Zhenhua
Sheng, Zhenli
Yu, Zhibin
Ji, Zhigang
Ren, Zhihao
Bian, Zhipeng
Liu, Zhixia
Dong, Zhiyu
Li, Zhonghua
Yu, Zhou
Shen, Zhuoming
Peng, Zhuwei
Ye, Zi
Xiang, Zihao
Fu, Zimin
Zhang, Zixuan
contents Scaled-out MoE LLMs and scaled-up SuperPods create new systems challenges for production Model-as-a-Service (MaaS), requiring disaggregation, low-latency communication, and decentralized serving. This report presents xDeepServe, the production serving system behind Huawei Cloud's MaaS offering on CloudMatrix384, a 48-server SuperPod with 384 Ascend 910C chips connected by a high-bandwidth UB fabric and global shared memory. It serves models including DeepSeek, Kimi, GLM, Qwen, and MiniMax, among others. xDeepServe is built around Transformerless, a disaggregated execution architecture that decomposes transformer inference into modular units -- attention, feedforward, and MoE -- and supports disaggregated Prefill-Decode and MoE-Attention deployments. To enable disaggregation, we develop XCCL, a memory-semantic communication layer providing microsecond-level point-to-point and scalable all-to-all primitives, and we extend FlowServe with decentralized DP groups and techniques to mitigate stragglers and synchronization variance. In a peak decoding configuration, xDeepServe reaches 2400 tokens/s per Ascend 910C chip at ~50ms time-per-output-token (TPOT).
format Preprint
id arxiv_https___arxiv_org_abs_2508_02520
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Huawei Cloud Model-as-a-Service on the CloudMatrix384 SuperPod
Xiao, Ao
He, Bangzheng
Zhang, Baoquan
Huai, Baoxing
Wang, Bingji
Wang, Bo
Xu, Bo
Hou, Boyi
Yang, Chan
Liu, Changhong
Cui, Cheng
Zhu, Chenyu
Feng, Cong
Wang, Daohui
Lin, Dayun
Zhao, Duo
Zou, Fengshao
Wang, Fu
Zhang, Gangqiang
Dan, Gengyuan
Chen, Guanjie
Guan, Guodong
Yang, Guodong
Li, Haifeng
Zhu, Haipei
Li, Haley
Feng, Hao
Huang, Hao
Xu, Hao
Ma, Hengrui
Fan, Hengtao
Liu, Hui
Li, Jia
Liu, Jiang
Xu, Jiang
Meng, Jie
Xin, Jinhan
Hu, Junhao
Chen, Juwei
Yu, Lan
Miao, Lanxin
Liu, Liang
Jing, Linan
Zhou, Lu
Han, Meina
Deng, Mingkun
Deng, Mingyu
Deng, Naitian
Lin, Nizhong
Zhao, Peihan
Pan, Peng
Shen, Pengfei
Li, Ping
Zhang, Qi
Wang, Qian
Xia, Qin ZhC Qingrong
Zhang, Qingyi
Fu, Qunchao
Guo, Ren
Gao, Ruimin
Li, Shaochun
Long, Sheng
Li, Shentian
Wan, Shining
Shen, Shuai
Zeng, Shuangfu
Jing, Shuming
Yang, Siqi
Zhang, Song
Xu, Tao
Du, Tianlin
Chen, Ting
Wu, Wanxu
Jiang, Wei
Tong, Weinan
Chen, Weiwei
Peng, Wen
Zhou, Wenli
Yang, Wenquan
Liang, Wenxin
Liu, Xiang
Zhou, Xiaoli
Jin, Xin
Duan, Xinyu
Li, Xu
Zhang, Xu
Chen, Xusheng
Shan, Yalong
Gan, Yang
Lu, Yao
Deng, Yi
Zheng, Yi
Xiong, Ying
Zheng, Yingfei
Zheng, Yiyun
Shan, Yizhou
Gao, Yong
Zhang, Yong
Yang, Yongqiang
Gong, Yuanjin
Yu, Yue
Chen, Yuetao
Zhu, Yukun
He, Yulong
Zhao, Yusu
Wu, Yuyan
Zhang, Zenan
Zhuo, Zhaojin
Ji, Zhaoyang
Wang, Zhefeng
Wang, Zheng
Fan, Zhenan
Yang, Zhenhua
Sheng, Zhenli
Yu, Zhibin
Ji, Zhigang
Ren, Zhihao
Bian, Zhipeng
Liu, Zhixia
Dong, Zhiyu
Li, Zhonghua
Yu, Zhou
Shen, Zhuoming
Peng, Zhuwei
Ye, Zi
Xiang, Zihao
Fu, Zimin
Zhang, Zixuan
Distributed, Parallel, and Cluster Computing
Scaled-out MoE LLMs and scaled-up SuperPods create new systems challenges for production Model-as-a-Service (MaaS), requiring disaggregation, low-latency communication, and decentralized serving. This report presents xDeepServe, the production serving system behind Huawei Cloud's MaaS offering on CloudMatrix384, a 48-server SuperPod with 384 Ascend 910C chips connected by a high-bandwidth UB fabric and global shared memory. It serves models including DeepSeek, Kimi, GLM, Qwen, and MiniMax, among others. xDeepServe is built around Transformerless, a disaggregated execution architecture that decomposes transformer inference into modular units -- attention, feedforward, and MoE -- and supports disaggregated Prefill-Decode and MoE-Attention deployments. To enable disaggregation, we develop XCCL, a memory-semantic communication layer providing microsecond-level point-to-point and scalable all-to-all primitives, and we extend FlowServe with decentralized DP groups and techniques to mitigate stragglers and synchronization variance. In a peak decoding configuration, xDeepServe reaches 2400 tokens/s per Ascend 910C chip at ~50ms time-per-output-token (TPOT).
title Huawei Cloud Model-as-a-Service on the CloudMatrix384 SuperPod
topic Distributed, Parallel, and Cluster Computing
url https://arxiv.org/abs/2508.02520