Saved in:
| Main Authors: | Xu, Chao, Zhang, Xu, Luo, Zihang, Wu, Yuyan, Qian, Guoxin, Yao, Yufeng, Wang, Chihyung, Zhou, Jingbin |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.22428 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
by: Huang, Jiajun, et al.
Published: (2023)
by: Huang, Jiajun, et al.
Published: (2023)
FUSCO: High-Performance Distributed Data Shuffling via Transformation-Communication Fusion
by: Zhu, Zhuoran, et al.
Published: (2025)
by: Zhu, Zhuoran, et al.
Published: (2025)
Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
by: Zhang, Xinyi, et al.
Published: (2024)
by: Zhang, Xinyi, et al.
Published: (2024)
Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control
by: Wang, Zhigang, et al.
Published: (2024)
by: Wang, Zhigang, et al.
Published: (2024)
An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters
by: Zhang, Mingjun, et al.
Published: (2025)
by: Zhang, Mingjun, et al.
Published: (2025)
Air-FedGA: A Grouping Asynchronous Federated Learning Mechanism Exploiting Over-the-air Computation
by: Ma, Qianpiao, et al.
Published: (2025)
by: Ma, Qianpiao, et al.
Published: (2025)
EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet
by: Yuan, Yitao, et al.
Published: (2026)
by: Yuan, Yitao, et al.
Published: (2026)
Cross-region Model Training with Communication-Computation Overlapping and Delay Compensation
by: Zhu, Ying, et al.
Published: (2025)
by: Zhu, Ying, et al.
Published: (2025)
Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement
by: Wu, Tian, et al.
Published: (2025)
by: Wu, Tian, et al.
Published: (2025)
Distributed Consensus Network: A Modularized Communication Framework and Reliability Probabilistic Analysis
by: Li, Yuetai, et al.
Published: (2025)
by: Li, Yuetai, et al.
Published: (2025)
On-the-fly Communication-and-Computing to Enable Representation Learning for Distributed Point Clouds
by: Chen, Xu, et al.
Published: (2024)
by: Chen, Xu, et al.
Published: (2024)
ZCCL: Significantly Improving Collective Communication With Error-Bounded Lossy Compression
by: Huang, Jiajun, et al.
Published: (2025)
by: Huang, Jiajun, et al.
Published: (2025)
Accelerating Compound LLM Training Workloads with Maestro
by: Yuan, Xiulong, et al.
Published: (2026)
by: Yuan, Xiulong, et al.
Published: (2026)
Accelerating Distributed MoE Training and Inference with Lina
by: Li, Jiamin, et al.
Published: (2022)
by: Li, Jiamin, et al.
Published: (2022)
Generic Multicast (Extended Version)
by: Bolina, José Augusto, et al.
Published: (2024)
by: Bolina, José Augusto, et al.
Published: (2024)
AES-SpMM: Balancing Accuracy and Speed by Adaptive Edge Sampling Strategy to Accelerate SpMM in GNNs
by: Song, Yingchen, et al.
Published: (2025)
by: Song, Yingchen, et al.
Published: (2025)
GPU-Accelerated Distributed QAOA on Large-scale HPC Ecosystems
by: Xu, Zhihao, et al.
Published: (2025)
by: Xu, Zhihao, et al.
Published: (2025)
Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training
by: Deng, Yangtao, et al.
Published: (2025)
by: Deng, Yangtao, et al.
Published: (2025)
Prime Collective Communications Library -- Technical Report
by: Keiblinger, Michael, et al.
Published: (2025)
by: Keiblinger, Michael, et al.
Published: (2025)
ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training
by: Lin, Wenxiang, et al.
Published: (2026)
by: Lin, Wenxiang, et al.
Published: (2026)
Hyperion: Hierarchical Scheduling for Parallel LLM Acceleration in Multi-tier Networks
by: Ma, Mulei, et al.
Published: (2025)
by: Ma, Mulei, et al.
Published: (2025)
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
by: Hu, Tianlun, et al.
Published: (2026)
by: Hu, Tianlun, et al.
Published: (2026)
Accelerating End-Cloud Collaborative Inference via Near Bubble-free Pipeline Optimization
by: Gao, Luyao, et al.
Published: (2024)
by: Gao, Luyao, et al.
Published: (2024)
HiCCL: A Hierarchical Collective Communication Library
by: Hidayetoglu, Mert, et al.
Published: (2024)
by: Hidayetoglu, Mert, et al.
Published: (2024)
Union: An Automatic Workload Manager for Accelerating Network Simulation
by: Wang, Xin, et al.
Published: (2024)
by: Wang, Xin, et al.
Published: (2024)
Collaborative Inference Acceleration with Non-Penetrative Tensor Partitioning
by: Liu, Zhibang, et al.
Published: (2025)
by: Liu, Zhibang, et al.
Published: (2025)
A HPX Communication Benchmark: Distributed FFT using Collectives
by: Strack, Alexander, et al.
Published: (2025)
by: Strack, Alexander, et al.
Published: (2025)
Extracting the Potential of Emerging Hardware Accelerators for Symmetric Eigenvalue Decomposition
by: Wang, Hansheng, et al.
Published: (2024)
by: Wang, Hansheng, et al.
Published: (2024)
Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding
by: Wang, Zhibin, et al.
Published: (2025)
by: Wang, Zhibin, et al.
Published: (2025)
Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs
by: Chen, Aodong, et al.
Published: (2023)
by: Chen, Aodong, et al.
Published: (2023)
Communication-Efficient Distributed Learning with Local Immediate Error Compensation
by: Cheng, Yifei, et al.
Published: (2024)
by: Cheng, Yifei, et al.
Published: (2024)
exa-AMD: A Scalable Workflow for Accelerating AI-Assisted Materials Discovery and Design
by: Moraru, Maxim, et al.
Published: (2025)
by: Moraru, Maxim, et al.
Published: (2025)
LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind
by: Zhang, Li, et al.
Published: (2025)
by: Zhang, Li, et al.
Published: (2025)
A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures
by: Sai, Ryuichi, et al.
Published: (2023)
by: Sai, Ryuichi, et al.
Published: (2023)
Accelerating OpenPangu Inference on NPU via Speculative Decoding
by: Dai, Yuntao, et al.
Published: (2026)
by: Dai, Yuntao, et al.
Published: (2026)
RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching
by: Zhao, Zhan, et al.
Published: (2026)
by: Zhao, Zhan, et al.
Published: (2026)
DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training
by: Tan, Xin, et al.
Published: (2025)
by: Tan, Xin, et al.
Published: (2025)
FlowMesh: A Service Fabric for Composable LLM Workflows
by: Shen, Junyi, et al.
Published: (2025)
by: Shen, Junyi, et al.
Published: (2025)
PALM: A Efficient Performance Simulator for Tiled Accelerators with Large-scale Model Training
by: Fang, Jiahao, et al.
Published: (2024)
by: Fang, Jiahao, et al.
Published: (2024)
Exploiting Stragglers in Distributed Computing Systems with Task Grouping
by: Adikari, Tharindu, et al.
Published: (2024)
by: Adikari, Tharindu, et al.
Published: (2024)
Similar Items
-
gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
by: Huang, Jiajun, et al.
Published: (2023) -
FUSCO: High-Performance Distributed Data Shuffling via Transformation-Communication Fusion
by: Zhu, Zhuoran, et al.
Published: (2025) -
Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
by: Zhang, Xinyi, et al.
Published: (2024) -
Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control
by: Wang, Zhigang, et al.
Published: (2024) -
An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters
by: Zhang, Mingjun, et al.
Published: (2025)