:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Wangni, Jianqiao
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2504.07513
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Galvatron: An Automatic Distributed System for Efficient Foundation Model Training
by: Liu, Xinyi, et al.
Published: (2025)

Laminar: A Scalable Asynchronous RL Post-Training Framework
by: Sheng, Guangming, et al.
Published: (2025)

Scalable and Adaptive Parallel Training of Graph Transformer on Large Graphs
by: Lin, Jun-Liang, et al.
Published: (2026)

Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers
by: Singh, Siddharth, et al.
Published: (2025)

Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training
by: Liu, Guanliang, et al.
Published: (2026)

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
by: Jeon, Byungsoo, et al.
Published: (2024)

Communication-free Sampling and 4D Hybrid Parallelism for Scalable Mini-batch GNN Training
by: Wei, Cunyang, et al.
Published: (2026)

Enhancing Data Quality in Federated Fine-Tuning of Foundation Models
by: Zhao, Wanru, et al.
Published: (2024)

A Survey of Resource-efficient LLM and Multimodal Foundation Models
by: Xu, Mengwei, et al.
Published: (2024)

Fine-Tuning GPT-5 for GPU Kernel Generation
by: Tehrani, Ali, et al.
Published: (2026)

Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer
by: Vooturi, Dharma Teja, et al.
Published: (2026)

GPU Memory Prediction for Multimodal Model Training
by: Jeong, Jinwoo, et al.
Published: (2025)

DP2FL: Dual Prompt Personalized Federated Learning in Foundation Models
by: Chang, Ying, et al.
Published: (2025)

When Foundation Model Meets Federated Learning: Motivations, Challenges, and Future Directions
by: Zhuang, Weiming, et al.
Published: (2023)

MQ-GNN: A Multi-Queue Pipelined Architecture for Scalable and Efficient GNN Training
by: Ullah, Irfan, et al.
Published: (2026)

Efficient and Scalable Agentic AI with Heterogeneous Systems
by: Asgar, Zain, et al.
Published: (2025)

Context Parallelism for Scalable Million-Token Inference
by: Yang, Amy, et al.
Published: (2024)

Scalable Artificial Intelligence for Science: Perspectives, Methods and Exemplars
by: Brewer, Wesley, et al.
Published: (2024)

TrainVerify: Equivalence-Based Verification for Distributed LLM Training
by: Lu, Yunchi, et al.
Published: (2025)

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization
by: Wan, Xinyi, et al.
Published: (2025)

Hubs and Spokes Learning: Efficient and Scalable Collaborative Machine Learning
by: Sharma, Atul, et al.
Published: (2025)

The Big Send-off: Scalable and Performant Collectives for Deep Learning
by: Singh, Siddharth, et al.
Published: (2025)

DistShap: Scalable GNN Explanations with Distributed Shapley Values
by: Akkas, Selahattin, et al.
Published: (2025)

DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling
by: Gao, Yubo, et al.
Published: (2025)

BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training
by: Wu, Houming, et al.
Published: (2024)

FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models
by: Yi, Kai, et al.
Published: (2024)

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
by: Yao, Jinghan, et al.
Published: (2024)

EASTER: Embedding Aggregation-based Heterogeneous Models Training in Vertical Federated Learning
by: Wang, Shuo, et al.
Published: (2023)

Training Heterogeneous Client Models using Knowledge Distillation in Serverless Federated Learning
by: Chadha, Mohak, et al.
Published: (2024)

Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training
by: Brewer, Wesley, et al.
Published: (2025)

FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication
by: Oakley, Joe, et al.
Published: (2024)

AccidentGPT: Large Multi-Modal Foundation Model for Traffic Accident Analysis
by: Wu, Kebin, et al.
Published: (2024)

TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training
by: Wu, Houming, et al.
Published: (2025)

Federated Learning with Workload Reduction through Partial Training of Client Models and Entropy-Based Data Selection
by: Shi, Hongrui, et al.
Published: (2024)

Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
by: Dash, Sajal, et al.
Published: (2026)

Post-Deterministic Distributed Systems: A New Foundation for Trustworthy Autonomous Infrastructure
by: He, Jun, et al.
Published: (2026)

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
by: Chen, Yanxi, et al.
Published: (2023)

FedPBS: Proximal-Balanced Scaling Federated Learning Model for Robust Personalized Training for Non-IID Data
by: AbouNassar, Eman M., et al.
Published: (2026)

SFPrompt: Communication-Efficient Split Federated Fine-Tuning for Large Pre-Trained Models over Resource-Limited Devices
by: Cao, Linxiao, et al.
Published: (2024)

Robust LLM Training Infrastructure at ByteDance
by: Wan, Borui, et al.
Published: (2025)