:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Chen, Ping, Zhang, Wenjie, He, Shuibing, Chen, Weijian, Yang, Siling, Huang, Kexin, Yin, Yanlong, Zhan, Xuan, Gu, Yingjie, Peng, Zhuwei, Zheng, Yi, Wang, Zhefeng, Chen, Gang
Format:	Preprint
Published:	2024
Subjects:	Distributed, Parallel, and Cluster Computing Machine Learning
Online Access:	https://arxiv.org/abs/2406.08756
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Adacc: An Adaptive Framework Unifying Compression and Activation Recomputation for LLM Training
by: Chen, Ping, et al.
Published: (2025)

HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration
by: Chen, Weijian, et al.
Published: (2024)

Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters
by: Chang, Zihan, et al.
Published: (2024)

Heimdall++: Optimizing GPU Utilization and Pipeline Parallelism for Efficient Single-Pulse Detection
by: Xia, Bingzheng, et al.
Published: (2025)

Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training
by: Xu, Guanbin, et al.
Published: (2026)

RServe: Overlapping Encoding and Prefill for Efficient LMM Inference
by: Guo, Tianyu, et al.
Published: (2025)

Edge-Cloud Collaborative Pothole Detection via Onboard Event Screening and Federated Temporal Segmentation
by: Wu, Yingjie, et al.
Published: (2026)

Cross-region Model Training with Communication-Computation Overlapping and Delay Compensation
by: Zhu, Ying, et al.
Published: (2025)

Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing
by: Wang, Yanbo, et al.
Published: (2026)

Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation
by: Chen, Fahao, et al.
Published: (2024)

AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training
by: Chen, Ling, et al.
Published: (2026)

KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
by: Jiang, Chaoyi, et al.
Published: (2024)

A Predictive and Synergistic Two-Layer Scheduling Framework for LLM Serving
by: Zhang, Yue, et al.
Published: (2025)

JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials
by: Wang, Hongyu, et al.
Published: (2026)

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
by: Gu, Diandian, et al.
Published: (2024)

Oases: Efficient Large-Scale Model Training on Commodity Servers via Overlapped and Automated Tensor Model Parallelism
by: Li, Shengwei, et al.
Published: (2023)

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler
by: Zheng, Size, et al.
Published: (2025)

CFP: Efficient Optimization of Intra-Operator Parallelism Plans for Large Model Training
by: Hu, Weifang, et al.
Published: (2025)

InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding
by: Chen, Qiaoling, et al.
Published: (2024)

WWW.Serve: Interconnecting Global LLM Services through Decentralization
by: Wang, Huanyu, et al.
Published: (2026)

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference
by: Wang, Liujianfu, et al.
Published: (2025)

Towards Affordable, Adaptive and Automatic GNN Training on CPU-GPU Heterogeneous Platforms
by: Qiao, Tong, et al.
Published: (2025)

Jiagu: Optimizing Serverless Computing Resource Utilization with Harmonized Efficiency and Practicability
by: Liu, Qingyuan, et al.
Published: (2024)

NeutronTP: Load-Balanced Distributed Full-Graph GNN Training with Tensor Parallelism
by: Ai, Xin, et al.
Published: (2024)

HGraphScale: Hierarchical Graph Learning for Autoscaling Microservice Applications in Container-based Cloud Computing
by: Fang, Zhengxin, et al.
Published: (2025)

Chameleon: Taming Dynamic Operator Sequences for Memory-Intensive LLM Training
by: Wang, Zibo, et al.
Published: (2025)

An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters
by: Zhang, Mingjun, et al.
Published: (2025)

Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training
by: Papavasileiou, Ioannis, et al.
Published: (2026)

CO2: Efficient Distributed Training with Full Communication-Computation Overlap
by: Sun, Weigao, et al.
Published: (2024)

Efficient Distributed MLLM Training with Cornstarch
by: Jang, Insu, et al.
Published: (2025)

Boosting LLM Serving through Spatial-Temporal GPU Resource Sharing
by: Lin, Zejia, et al.
Published: (2025)

BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training
by: Wu, Houming, et al.
Published: (2024)

HarmonyBatch: Batching multi-SLO DNN Inference with Heterogeneous Serverless Functions
by: Chen, Jiabin, et al.
Published: (2024)

FedOBD: Opportunistic Block Dropout for Efficiently Training Large-scale Neural Networks through Federated Learning
by: Chen, Yuanyuan, et al.
Published: (2022)

Boosting Scientific Error-Bounded Lossy Compression through Optimized Synergistic Lossy-Lossless Orchestration
by: Wu, Shixun, et al.
Published: (2025)

CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training
by: Chen, Tiancheng, et al.
Published: (2025)

Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models
by: Guo, Runsheng Benson, et al.
Published: (2024)

Next-Gen Computing Systems with Compute Express Link: a Comprehensive Survey
by: Chen, Chen, et al.
Published: (2024)

DawnPiper: A Memory-scablable Pipeline Parallel Training Framework
by: Peng, Xuan, et al.
Published: (2025)

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training
by: Hu, Tianhao, et al.
Published: (2026)