:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Wanqian, Peng, Jintao, Jing, Zongfei, Zhang, Tianyu, Long, Ze, Qiao, Xianjie, Chen, Xiaoming, Yang, Dongxu, Duan, Kefeng, Yang, June
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.01621
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

RServe: Overlapping Encoding and Prefill for Efficient LMM Inference
by: Guo, Tianyu, et al.
Published: (2025)

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
by: Guo, Tianyu, et al.
Published: (2025)

Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
by: Hidayetoglu, Mert, et al.
Published: (2025)

Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization
by: Chen, Jiu, et al.
Published: (2026)

AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism
by: Xu, Wendong, et al.
Published: (2025)

Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services
by: Chen, Haoyu, et al.
Published: (2025)

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
by: Zhang, Mingjin, et al.
Published: (2024)

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference
by: Zhao, Alan, et al.
Published: (2026)

OnePiece: A Large-Scale Distributed Inference System with RDMA for Complex AI-Generated Content (AIGC) Workflows
by: Chen, June, et al.
Published: (2026)

Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
by: Wei, Jinhui, et al.
Published: (2025)

Hyperion: Hierarchical Scheduling for Parallel LLM Acceleration in Multi-tier Networks
by: Ma, Mulei, et al.
Published: (2025)

Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
by: Rajbhandari, Samyam, et al.
Published: (2025)

TD-Pipe: Temporally-Disaggregated Pipeline Parallelism Architecture for High-Throughput LLM Inference
by: Zhang, Hongbin, et al.
Published: (2025)

APEX: Asynchronous Parallel CPU-GPU Execution for Online LLM Inference on Constrained GPUs
by: Fan, Jiakun, et al.
Published: (2025)

ACE-GNN: Adaptive GNN Co-Inference with System-Aware Scheduling in Dynamic Edge Environments
by: Zhou, Ao, et al.
Published: (2025)

Argus: Token Aware Distributed LLM Inference Optimization
by: Wu, Panlong, et al.
Published: (2025)

Distributed On-Device LLM Inference With Over-the-Air Computation
by: Zhang, Kai, et al.
Published: (2025)

SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference
by: He, Yongchao, et al.
Published: (2025)

Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism
by: Li, Cong, et al.
Published: (2025)

The Complexity of Distributed Minimum Weight Cycle Approximation
by: Chang, Yi-Jun, et al.
Published: (2026)

MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference
by: Tang, Xinru, et al.
Published: (2025)

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation
by: Cheng, Long, et al.
Published: (2026)

Fine-Grained Energy Prediction For Parallellized LLM Inference With PIE-P
by: Dutt, Anurag, et al.
Published: (2025)

ParallelSFL: A Novel Split Federated Learning Framework Tackling Heterogeneity Issues
by: Liao, Yunming, et al.
Published: (2024)

Bandwidth-Aware and Cost-Efficient Pipeline Parallel Scheduling in Geo-Distributed LLM Training
by: Zhang, Han, et al.
Published: (2026)

Federated Inference for Heterogeneous LLM Communication and Collaboration
by: Chen, Zihan, et al.
Published: (2026)

GPU-Based Parallel Computing Methods for Medical Photoacoustic Image Reconstruction
by: Yi, Xinyao, et al.
Published: (2024)

CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
by: Ma, Bin, et al.
Published: (2026)

Hybrid-Parallel: Achieving High Performance and Energy Efficient Distributed Inference on Robots
by: Sun, Zekai, et al.
Published: (2024)

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
by: Sun, Xun, et al.
Published: (2026)

DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization
by: Tang, Zhenheng, et al.
Published: (2025)

cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores
by: Li, Zixuan, et al.
Published: (2024)

SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines
by: Cheng, Ke, et al.
Published: (2024)

Learning to Shard: RL for Co-optimizing the Parallelism Degrees and Per-operator Sharding Dimensions in Distributed LLM Inference
by: Yin, Ruokai, et al.
Published: (2025)

HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference
by: Lin, Haoran, et al.
Published: (2025)

Staleness-Centric Optimizations for Parallel Diffusion MoE Inference
by: Luo, Jiajun, et al.
Published: (2024)

Enhancing Memory Efficiency in Large Language Model Training Through Chronos-aware Pipeline Parallelism
by: Lin, Xinyuan, et al.
Published: (2025)

Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture
by: Wu, Yu, et al.
Published: (2025)

Parallax: Efficient LLM Inference Service over Decentralized Environment
by: Tong, Chris, et al.
Published: (2025)

Bandwidth-Aware LLM Inference on Heterogeneous Many-Core Supercomputers
by: Lu, Yao, et al.
Published: (2026)