:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Zezhou, Li, Youjie, Lin, Zhiqi, Yang, Jiacheng, Xie, Cong, Feng, Guanyu, Zhong, Zheng, Huang, Ziyue, Zhu, Hongyu, Zhang, Zhi, Peng, Yanghua, Liu, Xin
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2602.22437
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

veScale: Consistent and Efficient Tensor Programming with Eager-Mode SPMD
by: Li, Youjie, et al.
Published: (2025)

SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile
by: Zhang, Ruisi, et al.
Published: (2024)

Performance Characterization of Distributed Deep Learning Strategies: A Quantitative Evaluation of DDP, FSDP, and Parameter Server Architectures on GPU Clusters
by: Ovi, Md Sultanul Islam
Published: (2025)

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
by: Ma, Qianli, et al.
Published: (2025)

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production
by: Jin, Chao, et al.
Published: (2025)

MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
by: Xue, Chunyu, et al.
Published: (2026)

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
by: Jiang, Ziheng, et al.
Published: (2024)

Exploring Uncore Frequency Scaling for Heterogeneous Computing
by: Zheng, Zhong, et al.
Published: (2025)

Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
by: Feng, Weiqi, et al.
Published: (2024)

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
by: Zhao, Juntao, et al.
Published: (2025)

FlexKV: Flexible Index Offloading for Memory-Disaggregated Key-Value Store
by: Hu, Zhisheng, et al.
Published: (2025)

PRISM: Dynamic Primitive-Based Forecasting for Large-Scale GPU Cluster Workloads
by: Wu, Xin, et al.
Published: (2026)

Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine
by: Zhang, Zuoning, et al.
Published: (2024)

Comparing Cross-Platform Performance via Node-to-Node Scaling Studies
by: Weiss, Kenneth, et al.
Published: (2025)

Data Caching for Enterprise-Grade Petabyte-Scale OLAP
by: Tang, Chunxu, et al.
Published: (2024)

EdgeVision: Towards Collaborative Video Analytics on Distributed Edges for Performance Maximization
by: Gao, Guanyu, et al.
Published: (2022)

EcoShift: Performance-Aware Power Management for Power-Constrained Heterogeneous Systems
by: Zheng, Zhong, et al.
Published: (2026)

madupite: A High-Performance Distributed Solver for Large-Scale Markov Decision Processes
by: Gargiani, Matilde, et al.
Published: (2025)

PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training
by: Golden, Alicia, et al.
Published: (2025)

Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters
by: Zhang, WenZheng, et al.
Published: (2024)

StatuScale: Status-aware and Elastic Scaling Strategy for Microservice Applications
by: Wen, Linfeng, et al.
Published: (2024)

DeepServe: Serverless Large Language Model Serving at Scale
by: Hu, Junhao, et al.
Published: (2025)

PolarStore: High-Performance Data Compression for Large-Scale Cloud-Native Databases
by: Hu, Qingda, et al.
Published: (2025)

ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments
by: Li, Haley, et al.
Published: (2026)

λScale: Enabling Fast Scaling for Serverless Large Language Model Inference
by: Yu, Minchen, et al.
Published: (2025)

M$^2$-MFP: A Multi-Scale and Multi-Level Memory Failure Prediction Framework for Reliable Cloud Infrastructure
by: Xie, Hongyi, et al.
Published: (2025)

Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale
by: Mittal, Kashish, et al.
Published: (2026)

A Tale of Two Scales: Reconciling Horizontal and Vertical Scaling for Inference Serving Systems
by: Razavi, Kamran, et al.
Published: (2024)

Case Study: Performance Analysis of a Virtualized XRootD Frontend in Large-Scale WAN Transfers
by: da Silva, J M, et al.
Published: (2026)

Deep Learning-Enabled Supercritical Flame Simulation at Detailed Chemistry and Real-Fluid Accuracy Towards Trillion-Cell Scale
by: Guo, Zhuoqiang, et al.
Published: (2025)

FAIR Ecosystems for Science at Scale
by: Wilkinson, Sean R., et al.
Published: (2025)

Scaling MPI Applications on Aurora
by: Ibeid, Huda, et al.
Published: (2025)

Steering a Fleet: Adaptation for Large-Scale, Workflow-Based Experiments
by: Pruyne, Jim, et al.
Published: (2024)

MPI-Q: A Message Communication Library for Large-Scale Classical-Quantum Heterogeneous Hybrid Distributed Computing
by: Wang, Feng, et al.
Published: (2026)

Scaling Real-Time Traffic Analytics on Edge-Cloud Fabrics for City-Scale Camera Networks
by: Sharma, Akash, et al.
Published: (2026)

SDSL-Solver: Scalable Distributed Sparse Linear Solvers for Large-Scale Interior Point Methods
by: Yang, Shaofeng, et al.
Published: (2026)

Barycentric Coded Distributed Computing with Flexible Recovery Threshold for Collaborative Mobile Edge Computing
by: Qiu, Houming, et al.
Published: (2025)

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
by: Zhu, Ruidong, et al.
Published: (2025)

ScalePool: Hybrid XLink-CXL Fabric for Composable Resource Disaggregation in Unified Scale-up Domains
by: Woo, Hyein, et al.
Published: (2025)

Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles
by: Arif, Moiz, et al.
Published: (2026)