:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Lyu, Yi
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.01607
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving
by: Qiu, Haoran, et al.
Published: (2025)

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models
by: Xu, Si, et al.
Published: (2024)

Optimizing Data Distribution and Kernel Performance for Efficient Training of Chemistry Foundation Models: A Case Study with MACE
by: Firoz, Jesun, et al.
Published: (2025)

Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training
by: Tan, Wenting, et al.
Published: (2023)

EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models
by: Cheng, Jialiang, et al.
Published: (2024)

Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization
by: Zhu, Zhanda, et al.
Published: (2025)

FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models
by: Yi, Kai, et al.
Published: (2024)

Byzantine-Robust and Communication-Efficient Distributed Training: Compressive and Cyclic Gradient Coding
by: Li, Chengxi, et al.
Published: (2026)

ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks
by: Shi, Ziji, et al.
Published: (2024)

TrainVerify: Equivalence-Based Verification for Distributed LLM Training
by: Lu, Yunchi, et al.
Published: (2025)

MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services
by: Yu, Dianhai, et al.
Published: (2022)

TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
by: Liu, Man, et al.
Published: (2026)

PacTrain: Pruning and Adaptive Sparse Gradient Compression for Efficient Collective Communication in Distributed Deep Learning
by: Wang, Yisu, et al.
Published: (2025)

Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems
by: Lu, Ning, et al.
Published: (2024)

Demystifying the Communication Characteristics for Distributed Transformer Models
by: Anthony, Quentin, et al.
Published: (2024)

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
by: Ma, Qianli, et al.
Published: (2025)

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning
by: Xu, Lang, et al.
Published: (2025)

Accelerating Large Language Model Training with Hybrid GPU-based Compression
by: Xu, Lang, et al.
Published: (2024)

Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training
by: Liang, Mingyu, et al.
Published: (2025)

DIP: Efficient Large Multimodal Model Training with Dynamic Interleaved Pipeline
by: Xue, Zhenliang, et al.
Published: (2025)

Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression
by: Sung, Mingyu, et al.
Published: (2025)

Dora: QoE-Aware Hybrid Parallelism for Distributed Edge AI
by: Jin, Jianli, et al.
Published: (2025)

Towards an Introspective Dynamic Model of Globally Distributed Computing Infrastructures
by: Kilic, Ozgur O., et al.
Published: (2025)

TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training
by: Han, Shujie, et al.
Published: (2026)

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
by: Zhao, Juntao, et al.
Published: (2025)

Galvatron: An Automatic Distributed System for Efficient Foundation Model Training
by: Liu, Xinyi, et al.
Published: (2025)

Training Through Failure: Effects of Data Consistency in Parallel Machine Learning Training
by: Cao, Ray, et al.
Published: (2024)

Failure-Resilient Distributed Inference with Model Compression over Heterogeneous Edge Devices
by: Wang, Li, et al.
Published: (2024)

Verify Distributed Deep Learning Model Implementation Refinement with Iterative Relation Inference
by: Wang, Zhanghan, et al.
Published: (2025)

FairBatching: Fairness-Aware Batch Formation for LLM Inference
by: Lyu, Hongtao, et al.
Published: (2025)

Multi-Agentic AI for Fairness-Aware and Accelerated Multi-modal Large Model Inference in Real-world Mobile Edge Networks
by: Li, Haiyuan, et al.
Published: (2026)

IoT-MCP: Bridging LLMs and IoT Systems Through Model Context Protocol
by: Yang, Ningyuan, et al.
Published: (2025)

CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
by: Gu, Yida, et al.
Published: (2026)

OrchMLLM: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
by: Zheng, Yijie, et al.
Published: (2025)

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation
by: Kaplan, Erel, et al.
Published: (2026)

Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs
by: Xin, Jihao, et al.
Published: (2026)

UNIFERENCE: A Discrete Event Simulation Framework for Developing Distributed AI Models
by: Eldenk, Doğaç, et al.
Published: (2026)

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
by: Yao, Jinghan, et al.
Published: (2024)

Revisiting Parameter Server in LLM Post-Training
by: Wan, Xinyi, et al.
Published: (2026)

Lightweight Trustworthy Distributed Clustering
by: Li, Hongyang, et al.
Published: (2025)