Saved in:
| Main Author: | Lyu, Yi |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.01607 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving
by: Qiu, Haoran, et al.
Published: (2025)
by: Qiu, Haoran, et al.
Published: (2025)
HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models
by: Xu, Si, et al.
Published: (2024)
by: Xu, Si, et al.
Published: (2024)
Optimizing Data Distribution and Kernel Performance for Efficient Training of Chemistry Foundation Models: A Case Study with MACE
by: Firoz, Jesun, et al.
Published: (2025)
by: Firoz, Jesun, et al.
Published: (2025)
Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training
by: Tan, Wenting, et al.
Published: (2023)
by: Tan, Wenting, et al.
Published: (2023)
EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models
by: Cheng, Jialiang, et al.
Published: (2024)
by: Cheng, Jialiang, et al.
Published: (2024)
Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization
by: Zhu, Zhanda, et al.
Published: (2025)
by: Zhu, Zhanda, et al.
Published: (2025)
FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models
by: Yi, Kai, et al.
Published: (2024)
by: Yi, Kai, et al.
Published: (2024)
Byzantine-Robust and Communication-Efficient Distributed Training: Compressive and Cyclic Gradient Coding
by: Li, Chengxi, et al.
Published: (2026)
by: Li, Chengxi, et al.
Published: (2026)
ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks
by: Shi, Ziji, et al.
Published: (2024)
by: Shi, Ziji, et al.
Published: (2024)
TrainVerify: Equivalence-Based Verification for Distributed LLM Training
by: Lu, Yunchi, et al.
Published: (2025)
by: Lu, Yunchi, et al.
Published: (2025)
MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services
by: Yu, Dianhai, et al.
Published: (2022)
by: Yu, Dianhai, et al.
Published: (2022)
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
by: Liu, Man, et al.
Published: (2026)
by: Liu, Man, et al.
Published: (2026)
PacTrain: Pruning and Adaptive Sparse Gradient Compression for Efficient Collective Communication in Distributed Deep Learning
by: Wang, Yisu, et al.
Published: (2025)
by: Wang, Yisu, et al.
Published: (2025)
Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems
by: Lu, Ning, et al.
Published: (2024)
by: Lu, Ning, et al.
Published: (2024)
Demystifying the Communication Characteristics for Distributed Transformer Models
by: Anthony, Quentin, et al.
Published: (2024)
by: Anthony, Quentin, et al.
Published: (2024)
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
by: Ma, Qianli, et al.
Published: (2025)
by: Ma, Qianli, et al.
Published: (2025)
Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning
by: Xu, Lang, et al.
Published: (2025)
by: Xu, Lang, et al.
Published: (2025)
Accelerating Large Language Model Training with Hybrid GPU-based Compression
by: Xu, Lang, et al.
Published: (2024)
by: Xu, Lang, et al.
Published: (2024)
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training
by: Liang, Mingyu, et al.
Published: (2025)
by: Liang, Mingyu, et al.
Published: (2025)
DIP: Efficient Large Multimodal Model Training with Dynamic Interleaved Pipeline
by: Xue, Zhenliang, et al.
Published: (2025)
by: Xue, Zhenliang, et al.
Published: (2025)
Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression
by: Sung, Mingyu, et al.
Published: (2025)
by: Sung, Mingyu, et al.
Published: (2025)
Dora: QoE-Aware Hybrid Parallelism for Distributed Edge AI
by: Jin, Jianli, et al.
Published: (2025)
by: Jin, Jianli, et al.
Published: (2025)
Towards an Introspective Dynamic Model of Globally Distributed Computing Infrastructures
by: Kilic, Ozgur O., et al.
Published: (2025)
by: Kilic, Ozgur O., et al.
Published: (2025)
TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training
by: Han, Shujie, et al.
Published: (2026)
by: Han, Shujie, et al.
Published: (2026)
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
by: Zhao, Juntao, et al.
Published: (2025)
by: Zhao, Juntao, et al.
Published: (2025)
Galvatron: An Automatic Distributed System for Efficient Foundation Model Training
by: Liu, Xinyi, et al.
Published: (2025)
by: Liu, Xinyi, et al.
Published: (2025)
Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training
by: Cao, Ray, et al.
Published: (2024)
by: Cao, Ray, et al.
Published: (2024)
Failure-Resilient Distributed Inference with Model Compression over Heterogeneous Edge Devices
by: Wang, Li, et al.
Published: (2024)
by: Wang, Li, et al.
Published: (2024)
Verify Distributed Deep Learning Model Implementation Refinement with Iterative Relation Inference
by: Wang, Zhanghan, et al.
Published: (2025)
by: Wang, Zhanghan, et al.
Published: (2025)
FairBatching: Fairness-Aware Batch Formation for LLM Inference
by: Lyu, Hongtao, et al.
Published: (2025)
by: Lyu, Hongtao, et al.
Published: (2025)
Multi-Agentic AI for Fairness-Aware and Accelerated Multi-modal Large Model Inference in Real-world Mobile Edge Networks
by: Li, Haiyuan, et al.
Published: (2026)
by: Li, Haiyuan, et al.
Published: (2026)
IoT-MCP: Bridging LLMs and IoT Systems Through Model Context Protocol
by: Yang, Ningyuan, et al.
Published: (2025)
by: Yang, Ningyuan, et al.
Published: (2025)
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
by: Gu, Yida, et al.
Published: (2026)
by: Gu, Yida, et al.
Published: (2026)
OrchMLLM: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
by: Zheng, Yijie, et al.
Published: (2025)
by: Zheng, Yijie, et al.
Published: (2025)
ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation
by: Kaplan, Erel, et al.
Published: (2026)
by: Kaplan, Erel, et al.
Published: (2026)
Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs
by: Xin, Jihao, et al.
Published: (2026)
by: Xin, Jihao, et al.
Published: (2026)
UNIFERENCE: A Discrete Event Simulation Framework for Developing Distributed AI Models
by: Eldenk, Doğaç, et al.
Published: (2026)
by: Eldenk, Doğaç, et al.
Published: (2026)
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
by: Yao, Jinghan, et al.
Published: (2024)
by: Yao, Jinghan, et al.
Published: (2024)
Revisiting Parameter Server in LLM Post-Training
by: Wan, Xinyi, et al.
Published: (2026)
by: Wan, Xinyi, et al.
Published: (2026)
Lightweight Trustworthy Distributed Clustering
by: Li, Hongyang, et al.
Published: (2025)
by: Li, Hongyang, et al.
Published: (2025)
Similar Items
-
ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving
by: Qiu, Haoran, et al.
Published: (2025) -
HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models
by: Xu, Si, et al.
Published: (2024) -
Optimizing Data Distribution and Kernel Performance for Efficient Training of Chemistry Foundation Models: A Case Study with MACE
by: Firoz, Jesun, et al.
Published: (2025) -
Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training
by: Tan, Wenting, et al.
Published: (2023) -
EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models
by: Cheng, Jialiang, et al.
Published: (2024)