:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Rongxiang, Shu, Kangyuan, Lin, Felix Xiaozhu
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2511.04477
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference
by: Lin, Shouxu, et al.
Published: (2026)

VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
by: Liu, Zihan, et al.
Published: (2025)

MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference
by: Yang, Zheming, et al.
Published: (2026)

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
by: Wang, Jiamin, et al.
Published: (2026)

Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching
by: Pang, Bowen, et al.
Published: (2025)

DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving
by: Liao, Junhan, et al.
Published: (2025)

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance
by: Liu, Di, et al.
Published: (2026)

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
by: Hu, Cunchen, et al.
Published: (2024)

Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling
by: Razavi, Kamran, et al.
Published: (2024)

Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
by: Hidayetoglu, Mert, et al.
Published: (2025)

Argus: Token Aware Distributed LLM Inference Optimization
by: Wu, Panlong, et al.
Published: (2025)

Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms
by: Zhang, Yaozheng, et al.
Published: (2025)

HybridFlow: Resource-Adaptive Subtask Routing for Efficient Edge-Cloud LLM Inference
by: Dong, Jiangwen, et al.
Published: (2025)

Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
by: Wang, Qipeng
Published: (2026)

λScale: Enabling Fast Scaling for Serverless Large Language Model Inference
by: Yu, Minchen, et al.
Published: (2025)

Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference
by: Yu, Minchen, et al.
Published: (2023)

Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services
by: Chen, Haoyu, et al.
Published: (2025)

SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines
by: Cheng, Ke, et al.
Published: (2024)

LLM-CoOpt: A Co-Design and Optimization Framework for Efficient LLM Inference on Heterogeneous Platforms
by: Kong, Jie, et al.
Published: (2026)

CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands
by: Wang, Weiye, et al.
Published: (2026)

Cloud Native System for LLM Inference Serving
by: Xu, Minxian, et al.
Published: (2025)

Federated Inference for Heterogeneous LLM Communication and Collaboration
by: Chen, Zihan, et al.
Published: (2026)

ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
by: Ma, Tenghui, et al.
Published: (2026)

Communication-Efficient Collaborative LLM Inference over LEO Satellite Networks
by: Zhang, Songge, et al.
Published: (2026)

Chameleon: Taming Dynamic Operator Sequences for Memory-Intensive LLM Training
by: Wang, Zibo, et al.
Published: (2025)

Pie: Pooling CPU Memory for LLM Inference
by: Xu, Yi, et al.
Published: (2024)

Sparsity-Aware Roofline Models for Sparse Matrix-Matrix Multiplication
by: Qian, Matthew, et al.
Published: (2026)

Power Aware Dynamic Reallocation For Inference
by: Jiang, Yiwei, et al.
Published: (2026)

Understanding the Performance and Power of LLM Inferencing on Edge Accelerators
by: Arya, Mayank, et al.
Published: (2025)

From Attention to Disaggregation: Tracing the Evolution of LLM Inference
by: Kumar, Madabattula Rajesh, et al.
Published: (2025)

Towards Resource-Efficient Serverless LLM Inference with SLINFER
by: Xu, Chuhao, et al.
Published: (2025)

Toward Sustainability-Aware LLM Inference on Edge Clusters
by: Rajashekar, Kolichala, et al.
Published: (2025)

Distributed On-Device LLM Inference With Over-the-Air Computation
by: Zhang, Kai, et al.
Published: (2025)

Efficient LLM Inference with Activation Checkpointing and Hybrid Caching
by: Lee, Sanghyeon, et al.
Published: (2025)

WANSpec: Leveraging Global Compute Capacity for LLM Inference
by: Martin, Noah, et al.
Published: (2026)

SYMPHONY: Improving Memory Management for LLM Inference Workloads
by: Agarwal, Saurabh, et al.
Published: (2024)

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality
by: Bournias, Ilias, et al.
Published: (2024)

LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
by: Da, Wei, et al.
Published: (2026)

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
by: Lin, Mao, et al.
Published: (2026)

A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference
by: Zhang, Yida, et al.
Published: (2026)