:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Zhuang, Xu, Zhaozhuo, Xi, Jingyi, Wang, Yuke, Shrivastava, Anshumali, Ng, T. S. Eugene
Format:	Preprint
Published:	2023
Subjects:	Machine Learning Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2309.13254
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)
by: Zhang, Tianyi, et al.
Published: (2025)

QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices
by: Zhao, Juntao, et al.
Published: (2024)

ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
by: Zuo, Jingwei, et al.
Published: (2026)

RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training
by: Gao, Wei, et al.
Published: (2025)

BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training
by: Sun, Ting, et al.
Published: (2026)

Echo: Simulating Distributed Training At Scale
by: Feng, Yicheng, et al.
Published: (2024)

Agglomerative Federated Learning: Empowering Larger Model Training via End-Edge-Cloud Collaboration
by: Wu, Zhiyuan, et al.
Published: (2023)

Empowering Data Mesh with Federated Learning
by: Li, Haoyuan, et al.
Published: (2024)

CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers
by: Bian, Jieming, et al.
Published: (2023)

Faster Distributed Inference-Only Recommender Systems via Bounded Lag Synchronous Collectives
by: Dichev, Kiril, et al.
Published: (2025)

Efficient Data Distribution Estimation for Accelerated Federated Learning
by: Wang, Yuanli, et al.
Published: (2024)

Incentivizing Permissionless Distributed Learning of LLMs
by: Lidin, Joel, et al.
Published: (2025)

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization
by: Wang, Chong, et al.
Published: (2026)

Robust Fully-Asynchronous Methods for Distributed Training over General Architecture
by: Zhu, Zehan, et al.
Published: (2023)

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency
by: Yao, Yuhang, et al.
Published: (2024)

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective
by: Go, Seokjin, et al.
Published: (2025)

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
by: Du, Zhixu, et al.
Published: (2023)

Accelerating Distributed ML Training via Selective Synchronization
by: Tyagi, Sahil, et al.
Published: (2023)

Distributed Training under Packet Loss
by: Weintraub, Erez, et al.
Published: (2025)

Empowering Federated Learning for Massive Models with NVIDIA FLARE
by: Roth, Holger R., et al.
Published: (2024)

Minder: Faulty Machine Detection for Large-scale Distributed Model Training
by: Deng, Yangtao, et al.
Published: (2024)

Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling
by: Wang, Yujie, et al.
Published: (2024)

AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes
by: Xiao, Youshao, et al.
Published: (2024)

Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
by: Qin, Ruoyu, et al.
Published: (2025)

Comprehensive Evaluation of GNN Training Systems: A Data Management Perspective
by: Yuan, Hao, et al.
Published: (2023)

Resource Efficient Asynchronous Federated Learning for Digital Twin Empowered IoT Network
by: Chu, Shunfeng, et al.
Published: (2024)

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities
by: Wei, Yunze, et al.
Published: (2024)

Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models
by: Wu, Yongji, et al.
Published: (2024)

Hyperdimensional Computing Empowered Federated Foundation Model over Wireless Networks for Metaverse
by: Ding, Yahao, et al.
Published: (2024)

Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment
by: Li, Haoyang, et al.
Published: (2024)

Efficient Parallelization Layouts for Large-Scale Distributed Model Training
by: Hagemann, Johannes, et al.
Published: (2023)

Distributed Convolutional Neural Network Training on Mobile and Edge Clusters
by: Rama, Pranav, et al.
Published: (2024)

Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training
by: Deng, Yangtao, et al.
Published: (2025)

DualSparse-MoE: Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction
by: Cai, Weilin, et al.
Published: (2025)

Unicron: Economizing Self-Healing LLM Training at Scale
by: He, Tao, et al.
Published: (2023)

Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet
by: Lidin, Joel, et al.
Published: (2026)

An Experimental Comparison of Partitioning Strategies for Distributed Graph Neural Network Training
by: Merkel, Nikolai, et al.
Published: (2023)

SparDL: Distributed Deep Learning Training with Efficient Sparse Communication
by: Zhao, Minjun, et al.
Published: (2023)

Understanding Silent Data Corruption in LLM Training
by: Ma, Jeffrey, et al.
Published: (2025)

Fully Distributed Online Training of Graph Neural Networks in Networked Systems
by: Olshevskyi, Rostyslav, et al.
Published: (2024)