Saved in:
| Main Authors: | Fujii, Kazuki, Watanabe, Kohei, Yokota, Rio |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.06465 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
by: Wang, Yujie, et al.
Published: (2024)
by: Wang, Yujie, et al.
Published: (2024)
Heterogeneous Parallelism for Multimodal Large Language Model Training
by: Karnati, Yashaswi, et al.
Published: (2026)
by: Karnati, Yashaswi, et al.
Published: (2026)
DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction
by: Zhang, Yanqi, et al.
Published: (2024)
by: Zhang, Yanqi, et al.
Published: (2024)
Efficient Parallelization Layouts for Large-Scale Distributed Model Training
by: Hagemann, Johannes, et al.
Published: (2023)
by: Hagemann, Johannes, et al.
Published: (2023)
BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training
by: Wu, Houming, et al.
Published: (2024)
by: Wu, Houming, et al.
Published: (2024)
AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training
by: Chen, Ling, et al.
Published: (2026)
by: Chen, Ling, et al.
Published: (2026)
Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training
by: Zhang, Xin, et al.
Published: (2025)
by: Zhang, Xin, et al.
Published: (2025)
MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core
by: Liu, Dennis, et al.
Published: (2025)
by: Liu, Dennis, et al.
Published: (2025)
Lion Cub: Minimizing Communication Overhead in Distributed Lion
by: Ishikawa, Satoki, et al.
Published: (2024)
by: Ishikawa, Satoki, et al.
Published: (2024)
FedPM: Federated Learning Using Second-order Optimization with Preconditioned Mixing of Local Parameters
by: Ishii, Hiro, et al.
Published: (2025)
by: Ishii, Hiro, et al.
Published: (2025)
Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference
by: Luo, Shuqing, et al.
Published: (2025)
by: Luo, Shuqing, et al.
Published: (2025)
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
by: Chen, Yanxi, et al.
Published: (2023)
by: Chen, Yanxi, et al.
Published: (2023)
Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design
by: Xue, Chunyu, et al.
Published: (2024)
by: Xue, Chunyu, et al.
Published: (2024)
Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment
by: Li, Haoyang, et al.
Published: (2024)
by: Li, Haoyang, et al.
Published: (2024)
Improving Automatic Parallel Training via Balanced Memory Workload Optimization
by: Wang, Yujie, et al.
Published: (2023)
by: Wang, Yujie, et al.
Published: (2023)
TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training
by: Wu, Houming, et al.
Published: (2025)
by: Wu, Houming, et al.
Published: (2025)
GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism
by: Polisetty, Sandeep, et al.
Published: (2023)
by: Polisetty, Sandeep, et al.
Published: (2023)
LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
by: Wu, Bingyang, et al.
Published: (2024)
by: Wu, Bingyang, et al.
Published: (2024)
ParaBlock: Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models
by: Wang, Yujia, et al.
Published: (2025)
by: Wang, Yujia, et al.
Published: (2025)
TurboGR: An Accelerated Training System for Large-Scale Generative Recommendation
by: Chai, Huichao, et al.
Published: (2026)
by: Chai, Huichao, et al.
Published: (2026)
DHO$_2$: Accelerating Distributed Hybrid Order Optimization via Model Parallelism and ADMM
by: Gu, Shunxian, et al.
Published: (2025)
by: Gu, Shunxian, et al.
Published: (2025)
Enabling Large Batch Size Training for DNN Models Beyond the Memory Limit While Maintaining Performance
by: Piao, XinYu, et al.
Published: (2021)
by: Piao, XinYu, et al.
Published: (2021)
SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
by: Jia, Jinda, et al.
Published: (2024)
by: Jia, Jinda, et al.
Published: (2024)
DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training
by: Wang, Yuanqing, et al.
Published: (2026)
by: Wang, Yuanqing, et al.
Published: (2026)
Optimization of Energy Consumption Forecasting in Puno using Parallel Computing and ARIMA Models: An Innovative Approach to Big Data Processing
by: Vilca-Tinta, Cliver W., et al.
Published: (2024)
by: Vilca-Tinta, Cliver W., et al.
Published: (2024)
Armada: Memory-Efficient Distributed Training of Large-Scale Graph Neural Networks
by: Waleffe, Roger, et al.
Published: (2025)
by: Waleffe, Roger, et al.
Published: (2025)
Go With The Flow: Churn-Tolerant Decentralized Training of Large Language Models
by: Blagoev, Nikolay, et al.
Published: (2025)
by: Blagoev, Nikolay, et al.
Published: (2025)
HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
by: Zhang, Geng, et al.
Published: (2025)
by: Zhang, Geng, et al.
Published: (2025)
PaSE: Parallelization Strategies for Efficient DNN Training
by: Elango, Venmugil
Published: (2024)
by: Elango, Venmugil
Published: (2024)
Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking
by: Ghadia, Ravi, et al.
Published: (2026)
by: Ghadia, Ravi, et al.
Published: (2026)
Memory and Bandwidth are All You Need for Fully Sharded Data Parallel
by: Wang, Jiangtao, et al.
Published: (2025)
by: Wang, Jiangtao, et al.
Published: (2025)
WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
by: Wang, Zheng, et al.
Published: (2025)
by: Wang, Zheng, et al.
Published: (2025)
Pipeline Parallelism with Controllable Memory
by: Qi, Penghui, et al.
Published: (2024)
by: Qi, Penghui, et al.
Published: (2024)
DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism
by: Niu, Yifan, et al.
Published: (2026)
by: Niu, Yifan, et al.
Published: (2026)
NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining
by: Jiang, Zhida, et al.
Published: (2026)
by: Jiang, Zhida, et al.
Published: (2026)
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
by: Wang, Zhengyang, et al.
Published: (2025)
by: Wang, Zhengyang, et al.
Published: (2025)
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
by: Jiang, Ziheng, et al.
Published: (2024)
by: Jiang, Ziheng, et al.
Published: (2024)
Pipette: Automatic Fine-grained Large Language Model Training Configurator for Real-World Clusters
by: Yim, Jinkyu, et al.
Published: (2024)
by: Yim, Jinkyu, et al.
Published: (2024)
Reducing Energy Bloat in Large Model Training
by: Chung, Jae-Won, et al.
Published: (2023)
by: Chung, Jae-Won, et al.
Published: (2023)
On Optimizing the Communication of Model Parallelism
by: Zhuang, Yonghao, et al.
Published: (2022)
by: Zhuang, Yonghao, et al.
Published: (2022)
Similar Items
-
FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
by: Wang, Yujie, et al.
Published: (2024) -
Heterogeneous Parallelism for Multimodal Large Language Model Training
by: Karnati, Yashaswi, et al.
Published: (2026) -
DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction
by: Zhang, Yanqi, et al.
Published: (2024) -
Efficient Parallelization Layouts for Large-Scale Distributed Model Training
by: Hagemann, Johannes, et al.
Published: (2023) -
BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training
by: Wu, Houming, et al.
Published: (2024)