Saved in:
| Main Authors: | Chen, Ping, Zhang, Wenjie, He, Shuibing, Chen, Weijian, Yang, Siling, Huang, Kexin, Yin, Yanlong, Zhan, Xuan, Gu, Yingjie, Peng, Zhuwei, Zheng, Yi, Wang, Zhefeng, Chen, Gang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.08756 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Adacc: An Adaptive Framework Unifying Compression and Activation Recomputation for LLM Training
by: Chen, Ping, et al.
Published: (2025)
by: Chen, Ping, et al.
Published: (2025)
HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration
by: Chen, Weijian, et al.
Published: (2024)
by: Chen, Weijian, et al.
Published: (2024)
Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters
by: Chang, Zihan, et al.
Published: (2024)
by: Chang, Zihan, et al.
Published: (2024)
Heimdall++: Optimizing GPU Utilization and Pipeline Parallelism for Efficient Single-Pulse Detection
by: Xia, Bingzheng, et al.
Published: (2025)
by: Xia, Bingzheng, et al.
Published: (2025)
Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training
by: Xu, Guanbin, et al.
Published: (2026)
by: Xu, Guanbin, et al.
Published: (2026)
RServe: Overlapping Encoding and Prefill for Efficient LMM Inference
by: Guo, Tianyu, et al.
Published: (2025)
by: Guo, Tianyu, et al.
Published: (2025)
Edge-Cloud Collaborative Pothole Detection via Onboard Event Screening and Federated Temporal Segmentation
by: Wu, Yingjie, et al.
Published: (2026)
by: Wu, Yingjie, et al.
Published: (2026)
Cross-region Model Training with Communication-Computation Overlapping and Delay Compensation
by: Zhu, Ying, et al.
Published: (2025)
by: Zhu, Ying, et al.
Published: (2025)
Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing
by: Wang, Yanbo, et al.
Published: (2026)
by: Wang, Yanbo, et al.
Published: (2026)
Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation
by: Chen, Fahao, et al.
Published: (2024)
by: Chen, Fahao, et al.
Published: (2024)
AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training
by: Chen, Ling, et al.
Published: (2026)
by: Chen, Ling, et al.
Published: (2026)
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
by: Jiang, Chaoyi, et al.
Published: (2024)
by: Jiang, Chaoyi, et al.
Published: (2024)
A Predictive and Synergistic Two-Layer Scheduling Framework for LLM Serving
by: Zhang, Yue, et al.
Published: (2025)
by: Zhang, Yue, et al.
Published: (2025)
JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials
by: Wang, Hongyu, et al.
Published: (2026)
by: Wang, Hongyu, et al.
Published: (2026)
LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
by: Gu, Diandian, et al.
Published: (2024)
by: Gu, Diandian, et al.
Published: (2024)
Oases: Efficient Large-Scale Model Training on Commodity Servers via Overlapped and Automated Tensor Model Parallelism
by: Li, Shengwei, et al.
Published: (2023)
by: Li, Shengwei, et al.
Published: (2023)
Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler
by: Zheng, Size, et al.
Published: (2025)
by: Zheng, Size, et al.
Published: (2025)
CFP: Efficient Optimization of Intra-Operator Parallelism Plans for Large Model Training
by: Hu, Weifang, et al.
Published: (2025)
by: Hu, Weifang, et al.
Published: (2025)
InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding
by: Chen, Qiaoling, et al.
Published: (2024)
by: Chen, Qiaoling, et al.
Published: (2024)
WWW.Serve: Interconnecting Global LLM Services through Decentralization
by: Wang, Huanyu, et al.
Published: (2026)
by: Wang, Huanyu, et al.
Published: (2026)
OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference
by: Wang, Liujianfu, et al.
Published: (2025)
by: Wang, Liujianfu, et al.
Published: (2025)
Towards Affordable, Adaptive and Automatic GNN Training on CPU-GPU Heterogeneous Platforms
by: Qiao, Tong, et al.
Published: (2025)
by: Qiao, Tong, et al.
Published: (2025)
Jiagu: Optimizing Serverless Computing Resource Utilization with Harmonized Efficiency and Practicability
by: Liu, Qingyuan, et al.
Published: (2024)
by: Liu, Qingyuan, et al.
Published: (2024)
NeutronTP: Load-Balanced Distributed Full-Graph GNN Training with Tensor Parallelism
by: Ai, Xin, et al.
Published: (2024)
by: Ai, Xin, et al.
Published: (2024)
HGraphScale: Hierarchical Graph Learning for Autoscaling Microservice Applications in Container-based Cloud Computing
by: Fang, Zhengxin, et al.
Published: (2025)
by: Fang, Zhengxin, et al.
Published: (2025)
Chameleon: Taming Dynamic Operator Sequences for Memory-Intensive LLM Training
by: Wang, Zibo, et al.
Published: (2025)
by: Wang, Zibo, et al.
Published: (2025)
An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters
by: Zhang, Mingjun, et al.
Published: (2025)
by: Zhang, Mingjun, et al.
Published: (2025)
Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training
by: Papavasileiou, Ioannis, et al.
Published: (2026)
by: Papavasileiou, Ioannis, et al.
Published: (2026)
CO2: Efficient Distributed Training with Full Communication-Computation Overlap
by: Sun, Weigao, et al.
Published: (2024)
by: Sun, Weigao, et al.
Published: (2024)
Efficient Distributed MLLM Training with Cornstarch
by: Jang, Insu, et al.
Published: (2025)
by: Jang, Insu, et al.
Published: (2025)
Boosting LLM Serving through Spatial-Temporal GPU Resource Sharing
by: Lin, Zejia, et al.
Published: (2025)
by: Lin, Zejia, et al.
Published: (2025)
BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training
by: Wu, Houming, et al.
Published: (2024)
by: Wu, Houming, et al.
Published: (2024)
HarmonyBatch: Batching multi-SLO DNN Inference with Heterogeneous Serverless Functions
by: Chen, Jiabin, et al.
Published: (2024)
by: Chen, Jiabin, et al.
Published: (2024)
FedOBD: Opportunistic Block Dropout for Efficiently Training Large-scale Neural Networks through Federated Learning
by: Chen, Yuanyuan, et al.
Published: (2022)
by: Chen, Yuanyuan, et al.
Published: (2022)
Boosting Scientific Error-Bounded Lossy Compression through Optimized Synergistic Lossy-Lossless Orchestration
by: Wu, Shixun, et al.
Published: (2025)
by: Wu, Shixun, et al.
Published: (2025)
CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training
by: Chen, Tiancheng, et al.
Published: (2025)
by: Chen, Tiancheng, et al.
Published: (2025)
Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models
by: Guo, Runsheng Benson, et al.
Published: (2024)
by: Guo, Runsheng Benson, et al.
Published: (2024)
Next-Gen Computing Systems with Compute Express Link: a Comprehensive Survey
by: Chen, Chen, et al.
Published: (2024)
by: Chen, Chen, et al.
Published: (2024)
DawnPiper: A Memory-scablable Pipeline Parallel Training Framework
by: Peng, Xuan, et al.
Published: (2025)
by: Peng, Xuan, et al.
Published: (2025)
DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training
by: Hu, Tianhao, et al.
Published: (2026)
by: Hu, Tianhao, et al.
Published: (2026)
Similar Items
-
Adacc: An Adaptive Framework Unifying Compression and Activation Recomputation for LLM Training
by: Chen, Ping, et al.
Published: (2025) -
HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration
by: Chen, Weijian, et al.
Published: (2024) -
Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters
by: Chang, Zihan, et al.
Published: (2024) -
Heimdall++: Optimizing GPU Utilization and Pipeline Parallelism for Efficient Single-Pulse Detection
by: Xia, Bingzheng, et al.
Published: (2025) -
Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training
by: Xu, Guanbin, et al.
Published: (2026)