:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ghadia, Ravi, Abraham, Maksim, Vorobyov, Sergei, Ryabinin, Max
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2602.21196
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference
by: Zhao, Xuanlei, et al.
Published: (2024)

DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism
by: Jiang, Chenyu, et al.
Published: (2025)

ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference
by: Meng, Han, et al.
Published: (2026)

LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
by: Wu, Bingyang, et al.
Published: (2024)

Improving Automatic Parallel Training via Balanced Memory Workload Optimization
by: Wang, Yujie, et al.
Published: (2023)

Memory and Bandwidth are All You Need for Fully Sharded Data Parallel
by: Wang, Jiangtao, et al.
Published: (2025)

Pipeline Parallelism with Controllable Memory
by: Qi, Penghui, et al.
Published: (2024)

HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
by: Zhang, Geng, et al.
Published: (2025)

Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design
by: Xue, Chunyu, et al.
Published: (2024)

DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction
by: Zhang, Yanqi, et al.
Published: (2024)

Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
by: Fujii, Kazuki, et al.
Published: (2024)

PaSE: Parallelization Strategies for Efficient DNN Training
by: Elango, Venmugil
Published: (2024)

DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism
by: Niu, Yifan, et al.
Published: (2026)

Efficient Parallelization Layouts for Large-Scale Distributed Model Training
by: Hagemann, Johannes, et al.
Published: (2023)

Efficient Parallel Reinforcement Learning Framework using the Reactor Model
by: Kwok, Jacky, et al.
Published: (2023)

Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
by: Prabhakar, Rohan Baskar, et al.
Published: (2024)

Efficient Long Context Fine-tuning with Chunk Flow
by: Yuan, Xiulong, et al.
Published: (2025)

Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification
by: Huang, Guang, et al.
Published: (2026)

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core
by: Liu, Dennis, et al.
Published: (2025)

ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
by: Liu, Zedong, et al.
Published: (2025)

SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning
by: Su, Jianchang, et al.
Published: (2026)

Scalable and Cost-Efficient ML Inference: Parallel Batch Processing with Serverless Functions
by: Barrak, Amine, et al.
Published: (2025)

CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
by: Ma, Bin, et al.
Published: (2026)

Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe
by: Huang, Mincong, et al.
Published: (2024)

PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving
by: Bai, Xu, et al.
Published: (2026)

Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
by: Rajbhandari, Samyam, et al.
Published: (2025)

Vanishing Variance Problem in Fully Decentralized Neural-Network Systems
by: Tian, Yongding, et al.
Published: (2024)

FedRDMA: Communication-Efficient Cross-Silo Federated LLM via Chunked RDMA Transmission
by: Zhang, Zeling, et al.
Published: (2024)

Context Parallelism for Scalable Million-Token Inference
by: Yang, Amy, et al.
Published: (2024)

DHO$_2$: Accelerating Distributed Hybrid Order Optimization via Model Parallelism and ADMM
by: Gu, Shunxian, et al.
Published: (2025)

FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
by: Wang, Yujie, et al.
Published: (2024)

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism
by: Polisetty, Sandeep, et al.
Published: (2023)

On Optimizing the Communication of Model Parallelism
by: Zhuang, Yonghao, et al.
Published: (2022)

Armada: Memory-Efficient Distributed Training of Large-Scale Graph Neural Networks
by: Waleffe, Roger, et al.
Published: (2025)

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
by: Gupta, Ahan, et al.
Published: (2026)

Edge-Parallel Graph Encoder Embedding
by: Lubonja, Ariel, et al.
Published: (2024)

TASP: Topology-aware Sequence Parallelism
by: Wang, Yida, et al.
Published: (2025)

Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment
by: Li, Haoyang, et al.
Published: (2024)

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization
by: Wan, Xinyi, et al.
Published: (2025)

Breaking the Memory Wall for Heterogeneous Federated Learning via Progressive Training
by: Wu, Yebo, et al.
Published: (2024)