Saved in:
| Main Authors: | Li, Wanqian, Peng, Jintao, Jing, Zongfei, Zhang, Tianyu, Long, Ze, Qiao, Xianjie, Chen, Xiaoming, Yang, Dongxu, Duan, Kefeng, Yang, June |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.01621 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
RServe: Overlapping Encoding and Prefill for Efficient LMM Inference
by: Guo, Tianyu, et al.
Published: (2025)
by: Guo, Tianyu, et al.
Published: (2025)
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
by: Guo, Tianyu, et al.
Published: (2025)
by: Guo, Tianyu, et al.
Published: (2025)
Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
by: Hidayetoglu, Mert, et al.
Published: (2025)
by: Hidayetoglu, Mert, et al.
Published: (2025)
Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization
by: Chen, Jiu, et al.
Published: (2026)
by: Chen, Jiu, et al.
Published: (2026)
AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism
by: Xu, Wendong, et al.
Published: (2025)
by: Xu, Wendong, et al.
Published: (2025)
Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services
by: Chen, Haoyu, et al.
Published: (2025)
by: Chen, Haoyu, et al.
Published: (2025)
EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
by: Zhang, Mingjin, et al.
Published: (2024)
by: Zhang, Mingjin, et al.
Published: (2024)
SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference
by: Zhao, Alan, et al.
Published: (2026)
by: Zhao, Alan, et al.
Published: (2026)
OnePiece: A Large-Scale Distributed Inference System with RDMA for Complex AI-Generated Content (AIGC) Workflows
by: Chen, June, et al.
Published: (2026)
by: Chen, June, et al.
Published: (2026)
Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
by: Wei, Jinhui, et al.
Published: (2025)
by: Wei, Jinhui, et al.
Published: (2025)
Hyperion: Hierarchical Scheduling for Parallel LLM Acceleration in Multi-tier Networks
by: Ma, Mulei, et al.
Published: (2025)
by: Ma, Mulei, et al.
Published: (2025)
Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
by: Rajbhandari, Samyam, et al.
Published: (2025)
by: Rajbhandari, Samyam, et al.
Published: (2025)
TD-Pipe: Temporally-Disaggregated Pipeline Parallelism Architecture for High-Throughput LLM Inference
by: Zhang, Hongbin, et al.
Published: (2025)
by: Zhang, Hongbin, et al.
Published: (2025)
APEX: Asynchronous Parallel CPU-GPU Execution for Online LLM Inference on Constrained GPUs
by: Fan, Jiakun, et al.
Published: (2025)
by: Fan, Jiakun, et al.
Published: (2025)
ACE-GNN: Adaptive GNN Co-Inference with System-Aware Scheduling in Dynamic Edge Environments
by: Zhou, Ao, et al.
Published: (2025)
by: Zhou, Ao, et al.
Published: (2025)
Argus: Token Aware Distributed LLM Inference Optimization
by: Wu, Panlong, et al.
Published: (2025)
by: Wu, Panlong, et al.
Published: (2025)
Distributed On-Device LLM Inference With Over-the-Air Computation
by: Zhang, Kai, et al.
Published: (2025)
by: Zhang, Kai, et al.
Published: (2025)
SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference
by: He, Yongchao, et al.
Published: (2025)
by: He, Yongchao, et al.
Published: (2025)
Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism
by: Li, Cong, et al.
Published: (2025)
by: Li, Cong, et al.
Published: (2025)
The Complexity of Distributed Minimum Weight Cycle Approximation
by: Chang, Yi-Jun, et al.
Published: (2026)
by: Chang, Yi-Jun, et al.
Published: (2026)
MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference
by: Tang, Xinru, et al.
Published: (2025)
by: Tang, Xinru, et al.
Published: (2025)
Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation
by: Cheng, Long, et al.
Published: (2026)
by: Cheng, Long, et al.
Published: (2026)
Fine-Grained Energy Prediction For Parallellized LLM Inference With PIE-P
by: Dutt, Anurag, et al.
Published: (2025)
by: Dutt, Anurag, et al.
Published: (2025)
ParallelSFL: A Novel Split Federated Learning Framework Tackling Heterogeneity Issues
by: Liao, Yunming, et al.
Published: (2024)
by: Liao, Yunming, et al.
Published: (2024)
Bandwidth-Aware and Cost-Efficient Pipeline Parallel Scheduling in Geo-Distributed LLM Training
by: Zhang, Han, et al.
Published: (2026)
by: Zhang, Han, et al.
Published: (2026)
Federated Inference for Heterogeneous LLM Communication and Collaboration
by: Chen, Zihan, et al.
Published: (2026)
by: Chen, Zihan, et al.
Published: (2026)
GPU-Based Parallel Computing Methods for Medical Photoacoustic Image Reconstruction
by: Yi, Xinyao, et al.
Published: (2024)
by: Yi, Xinyao, et al.
Published: (2024)
CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
by: Ma, Bin, et al.
Published: (2026)
by: Ma, Bin, et al.
Published: (2026)
Hybrid-Parallel: Achieving High Performance and Energy Efficient Distributed Inference on Robots
by: Sun, Zekai, et al.
Published: (2024)
by: Sun, Zekai, et al.
Published: (2024)
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
by: Sun, Xun, et al.
Published: (2026)
by: Sun, Xun, et al.
Published: (2026)
DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization
by: Tang, Zhenheng, et al.
Published: (2025)
by: Tang, Zhenheng, et al.
Published: (2025)
cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores
by: Li, Zixuan, et al.
Published: (2024)
by: Li, Zixuan, et al.
Published: (2024)
SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines
by: Cheng, Ke, et al.
Published: (2024)
by: Cheng, Ke, et al.
Published: (2024)
Learning to Shard: RL for Co-optimizing the Parallelism Degrees and Per-operator Sharding Dimensions in Distributed LLM Inference
by: Yin, Ruokai, et al.
Published: (2025)
by: Yin, Ruokai, et al.
Published: (2025)
HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference
by: Lin, Haoran, et al.
Published: (2025)
by: Lin, Haoran, et al.
Published: (2025)
Staleness-Centric Optimizations for Parallel Diffusion MoE Inference
by: Luo, Jiajun, et al.
Published: (2024)
by: Luo, Jiajun, et al.
Published: (2024)
Enhancing Memory Efficiency in Large Language Model Training Through Chronos-aware Pipeline Parallelism
by: Lin, Xinyuan, et al.
Published: (2025)
by: Lin, Xinyuan, et al.
Published: (2025)
Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture
by: Wu, Yu, et al.
Published: (2025)
by: Wu, Yu, et al.
Published: (2025)
Parallax: Efficient LLM Inference Service over Decentralized Environment
by: Tong, Chris, et al.
Published: (2025)
by: Tong, Chris, et al.
Published: (2025)
Bandwidth-Aware LLM Inference on Heterogeneous Many-Core Supercomputers
by: Lu, Yao, et al.
Published: (2026)
by: Lu, Yao, et al.
Published: (2026)
Similar Items
-
RServe: Overlapping Encoding and Prefill for Efficient LMM Inference
by: Guo, Tianyu, et al.
Published: (2025) -
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
by: Guo, Tianyu, et al.
Published: (2025) -
Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
by: Hidayetoglu, Mert, et al.
Published: (2025) -
Distributed Generative Inference of LLM at Internet Scales with Multi-Dimensional Communication Optimization
by: Chen, Jiu, et al.
Published: (2026) -
AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism
by: Xu, Wendong, et al.
Published: (2025)