Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.02520 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911474974195712 |
|---|---|
| author | Xiao, Ao He, Bangzheng Zhang, Baoquan Huai, Baoxing Wang, Bingji Wang, Bo Xu, Bo Hou, Boyi Yang, Chan Liu, Changhong Cui, Cheng Zhu, Chenyu Feng, Cong Wang, Daohui Lin, Dayun Zhao, Duo Zou, Fengshao Wang, Fu Zhang, Gangqiang Dan, Gengyuan Chen, Guanjie Guan, Guodong Yang, Guodong Li, Haifeng Zhu, Haipei Li, Haley Feng, Hao Huang, Hao Xu, Hao Ma, Hengrui Fan, Hengtao Liu, Hui Li, Jia Liu, Jiang Xu, Jiang Meng, Jie Xin, Jinhan Hu, Junhao Chen, Juwei Yu, Lan Miao, Lanxin Liu, Liang Jing, Linan Zhou, Lu Han, Meina Deng, Mingkun Deng, Mingyu Deng, Naitian Lin, Nizhong Zhao, Peihan Pan, Peng Shen, Pengfei Li, Ping Zhang, Qi Wang, Qian Xia, Qin ZhC Qingrong Zhang, Qingyi Fu, Qunchao Guo, Ren Gao, Ruimin Li, Shaochun Long, Sheng Li, Shentian Wan, Shining Shen, Shuai Zeng, Shuangfu Jing, Shuming Yang, Siqi Zhang, Song Xu, Tao Du, Tianlin Chen, Ting Wu, Wanxu Jiang, Wei Tong, Weinan Chen, Weiwei Peng, Wen Zhou, Wenli Yang, Wenquan Liang, Wenxin Liu, Xiang Zhou, Xiaoli Jin, Xin Duan, Xinyu Li, Xu Zhang, Xu Chen, Xusheng Shan, Yalong Gan, Yang Lu, Yao Deng, Yi Zheng, Yi Xiong, Ying Zheng, Yingfei Zheng, Yiyun Shan, Yizhou Gao, Yong Zhang, Yong Yang, Yongqiang Gong, Yuanjin Yu, Yue Chen, Yuetao Zhu, Yukun He, Yulong Zhao, Yusu Wu, Yuyan Zhang, Zenan Zhuo, Zhaojin Ji, Zhaoyang Wang, Zhefeng Wang, Zheng Fan, Zhenan Yang, Zhenhua Sheng, Zhenli Yu, Zhibin Ji, Zhigang Ren, Zhihao Bian, Zhipeng Liu, Zhixia Dong, Zhiyu Li, Zhonghua Yu, Zhou Shen, Zhuoming Peng, Zhuwei Ye, Zi Xiang, Zihao Fu, Zimin Zhang, Zixuan |
| author_facet | Xiao, Ao He, Bangzheng Zhang, Baoquan Huai, Baoxing Wang, Bingji Wang, Bo Xu, Bo Hou, Boyi Yang, Chan Liu, Changhong Cui, Cheng Zhu, Chenyu Feng, Cong Wang, Daohui Lin, Dayun Zhao, Duo Zou, Fengshao Wang, Fu Zhang, Gangqiang Dan, Gengyuan Chen, Guanjie Guan, Guodong Yang, Guodong Li, Haifeng Zhu, Haipei Li, Haley Feng, Hao Huang, Hao Xu, Hao Ma, Hengrui Fan, Hengtao Liu, Hui Li, Jia Liu, Jiang Xu, Jiang Meng, Jie Xin, Jinhan Hu, Junhao Chen, Juwei Yu, Lan Miao, Lanxin Liu, Liang Jing, Linan Zhou, Lu Han, Meina Deng, Mingkun Deng, Mingyu Deng, Naitian Lin, Nizhong Zhao, Peihan Pan, Peng Shen, Pengfei Li, Ping Zhang, Qi Wang, Qian Xia, Qin ZhC Qingrong Zhang, Qingyi Fu, Qunchao Guo, Ren Gao, Ruimin Li, Shaochun Long, Sheng Li, Shentian Wan, Shining Shen, Shuai Zeng, Shuangfu Jing, Shuming Yang, Siqi Zhang, Song Xu, Tao Du, Tianlin Chen, Ting Wu, Wanxu Jiang, Wei Tong, Weinan Chen, Weiwei Peng, Wen Zhou, Wenli Yang, Wenquan Liang, Wenxin Liu, Xiang Zhou, Xiaoli Jin, Xin Duan, Xinyu Li, Xu Zhang, Xu Chen, Xusheng Shan, Yalong Gan, Yang Lu, Yao Deng, Yi Zheng, Yi Xiong, Ying Zheng, Yingfei Zheng, Yiyun Shan, Yizhou Gao, Yong Zhang, Yong Yang, Yongqiang Gong, Yuanjin Yu, Yue Chen, Yuetao Zhu, Yukun He, Yulong Zhao, Yusu Wu, Yuyan Zhang, Zenan Zhuo, Zhaojin Ji, Zhaoyang Wang, Zhefeng Wang, Zheng Fan, Zhenan Yang, Zhenhua Sheng, Zhenli Yu, Zhibin Ji, Zhigang Ren, Zhihao Bian, Zhipeng Liu, Zhixia Dong, Zhiyu Li, Zhonghua Yu, Zhou Shen, Zhuoming Peng, Zhuwei Ye, Zi Xiang, Zihao Fu, Zimin Zhang, Zixuan |
| contents | Scaled-out MoE LLMs and scaled-up SuperPods create new systems challenges for production Model-as-a-Service (MaaS), requiring disaggregation, low-latency communication, and decentralized serving. This report presents xDeepServe, the production serving system behind Huawei Cloud's MaaS offering on CloudMatrix384, a 48-server SuperPod with 384 Ascend 910C chips connected by a high-bandwidth UB fabric and global shared memory. It serves models including DeepSeek, Kimi, GLM, Qwen, and MiniMax, among others. xDeepServe is built around Transformerless, a disaggregated execution architecture that decomposes transformer inference into modular units -- attention, feedforward, and MoE -- and supports disaggregated Prefill-Decode and MoE-Attention deployments. To enable disaggregation, we develop XCCL, a memory-semantic communication layer providing microsecond-level point-to-point and scalable all-to-all primitives, and we extend FlowServe with decentralized DP groups and techniques to mitigate stragglers and synchronization variance. In a peak decoding configuration, xDeepServe reaches 2400 tokens/s per Ascend 910C chip at ~50ms time-per-output-token (TPOT). |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2508_02520 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Huawei Cloud Model-as-a-Service on the CloudMatrix384 SuperPod Xiao, Ao He, Bangzheng Zhang, Baoquan Huai, Baoxing Wang, Bingji Wang, Bo Xu, Bo Hou, Boyi Yang, Chan Liu, Changhong Cui, Cheng Zhu, Chenyu Feng, Cong Wang, Daohui Lin, Dayun Zhao, Duo Zou, Fengshao Wang, Fu Zhang, Gangqiang Dan, Gengyuan Chen, Guanjie Guan, Guodong Yang, Guodong Li, Haifeng Zhu, Haipei Li, Haley Feng, Hao Huang, Hao Xu, Hao Ma, Hengrui Fan, Hengtao Liu, Hui Li, Jia Liu, Jiang Xu, Jiang Meng, Jie Xin, Jinhan Hu, Junhao Chen, Juwei Yu, Lan Miao, Lanxin Liu, Liang Jing, Linan Zhou, Lu Han, Meina Deng, Mingkun Deng, Mingyu Deng, Naitian Lin, Nizhong Zhao, Peihan Pan, Peng Shen, Pengfei Li, Ping Zhang, Qi Wang, Qian Xia, Qin ZhC Qingrong Zhang, Qingyi Fu, Qunchao Guo, Ren Gao, Ruimin Li, Shaochun Long, Sheng Li, Shentian Wan, Shining Shen, Shuai Zeng, Shuangfu Jing, Shuming Yang, Siqi Zhang, Song Xu, Tao Du, Tianlin Chen, Ting Wu, Wanxu Jiang, Wei Tong, Weinan Chen, Weiwei Peng, Wen Zhou, Wenli Yang, Wenquan Liang, Wenxin Liu, Xiang Zhou, Xiaoli Jin, Xin Duan, Xinyu Li, Xu Zhang, Xu Chen, Xusheng Shan, Yalong Gan, Yang Lu, Yao Deng, Yi Zheng, Yi Xiong, Ying Zheng, Yingfei Zheng, Yiyun Shan, Yizhou Gao, Yong Zhang, Yong Yang, Yongqiang Gong, Yuanjin Yu, Yue Chen, Yuetao Zhu, Yukun He, Yulong Zhao, Yusu Wu, Yuyan Zhang, Zenan Zhuo, Zhaojin Ji, Zhaoyang Wang, Zhefeng Wang, Zheng Fan, Zhenan Yang, Zhenhua Sheng, Zhenli Yu, Zhibin Ji, Zhigang Ren, Zhihao Bian, Zhipeng Liu, Zhixia Dong, Zhiyu Li, Zhonghua Yu, Zhou Shen, Zhuoming Peng, Zhuwei Ye, Zi Xiang, Zihao Fu, Zimin Zhang, Zixuan Distributed, Parallel, and Cluster Computing Scaled-out MoE LLMs and scaled-up SuperPods create new systems challenges for production Model-as-a-Service (MaaS), requiring disaggregation, low-latency communication, and decentralized serving. This report presents xDeepServe, the production serving system behind Huawei Cloud's MaaS offering on CloudMatrix384, a 48-server SuperPod with 384 Ascend 910C chips connected by a high-bandwidth UB fabric and global shared memory. It serves models including DeepSeek, Kimi, GLM, Qwen, and MiniMax, among others. xDeepServe is built around Transformerless, a disaggregated execution architecture that decomposes transformer inference into modular units -- attention, feedforward, and MoE -- and supports disaggregated Prefill-Decode and MoE-Attention deployments. To enable disaggregation, we develop XCCL, a memory-semantic communication layer providing microsecond-level point-to-point and scalable all-to-all primitives, and we extend FlowServe with decentralized DP groups and techniques to mitigate stragglers and synchronization variance. In a peak decoding configuration, xDeepServe reaches 2400 tokens/s per Ascend 910C chip at ~50ms time-per-output-token (TPOT). |
| title | Huawei Cloud Model-as-a-Service on the CloudMatrix384 SuperPod |
| topic | Distributed, Parallel, and Cluster Computing |
| url | https://arxiv.org/abs/2508.02520 |