Saved in:
| Main Authors: | Wang, Rongxiang, Shu, Kangyuan, Lin, Felix Xiaozhu |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.04477 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference
by: Lin, Shouxu, et al.
Published: (2026)
by: Lin, Shouxu, et al.
Published: (2026)
VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
by: Liu, Zihan, et al.
Published: (2025)
by: Liu, Zihan, et al.
Published: (2025)
MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference
by: Yang, Zheming, et al.
Published: (2026)
by: Yang, Zheming, et al.
Published: (2026)
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
by: Wang, Jiamin, et al.
Published: (2026)
by: Wang, Jiamin, et al.
Published: (2026)
Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching
by: Pang, Bowen, et al.
Published: (2025)
by: Pang, Bowen, et al.
Published: (2025)
DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving
by: Liao, Junhan, et al.
Published: (2025)
by: Liao, Junhan, et al.
Published: (2025)
S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance
by: Liu, Di, et al.
Published: (2026)
by: Liu, Di, et al.
Published: (2026)
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
by: Hu, Cunchen, et al.
Published: (2024)
by: Hu, Cunchen, et al.
Published: (2024)
Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling
by: Razavi, Kamran, et al.
Published: (2024)
by: Razavi, Kamran, et al.
Published: (2024)
Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
by: Hidayetoglu, Mert, et al.
Published: (2025)
by: Hidayetoglu, Mert, et al.
Published: (2025)
Argus: Token Aware Distributed LLM Inference Optimization
by: Wu, Panlong, et al.
Published: (2025)
by: Wu, Panlong, et al.
Published: (2025)
Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms
by: Zhang, Yaozheng, et al.
Published: (2025)
by: Zhang, Yaozheng, et al.
Published: (2025)
HybridFlow: Resource-Adaptive Subtask Routing for Efficient Edge-Cloud LLM Inference
by: Dong, Jiangwen, et al.
Published: (2025)
by: Dong, Jiangwen, et al.
Published: (2025)
Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
by: Wang, Qipeng
Published: (2026)
by: Wang, Qipeng
Published: (2026)
λScale: Enabling Fast Scaling for Serverless Large Language Model Inference
by: Yu, Minchen, et al.
Published: (2025)
by: Yu, Minchen, et al.
Published: (2025)
Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference
by: Yu, Minchen, et al.
Published: (2023)
by: Yu, Minchen, et al.
Published: (2023)
Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services
by: Chen, Haoyu, et al.
Published: (2025)
by: Chen, Haoyu, et al.
Published: (2025)
SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines
by: Cheng, Ke, et al.
Published: (2024)
by: Cheng, Ke, et al.
Published: (2024)
LLM-CoOpt: A Co-Design and Optimization Framework for Efficient LLM Inference on Heterogeneous Platforms
by: Kong, Jie, et al.
Published: (2026)
by: Kong, Jie, et al.
Published: (2026)
CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands
by: Wang, Weiye, et al.
Published: (2026)
by: Wang, Weiye, et al.
Published: (2026)
Cloud Native System for LLM Inference Serving
by: Xu, Minxian, et al.
Published: (2025)
by: Xu, Minxian, et al.
Published: (2025)
Federated Inference for Heterogeneous LLM Communication and Collaboration
by: Chen, Zihan, et al.
Published: (2026)
by: Chen, Zihan, et al.
Published: (2026)
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
by: Ma, Tenghui, et al.
Published: (2026)
by: Ma, Tenghui, et al.
Published: (2026)
Communication-Efficient Collaborative LLM Inference over LEO Satellite Networks
by: Zhang, Songge, et al.
Published: (2026)
by: Zhang, Songge, et al.
Published: (2026)
Chameleon: Taming Dynamic Operator Sequences for Memory-Intensive LLM Training
by: Wang, Zibo, et al.
Published: (2025)
by: Wang, Zibo, et al.
Published: (2025)
Pie: Pooling CPU Memory for LLM Inference
by: Xu, Yi, et al.
Published: (2024)
by: Xu, Yi, et al.
Published: (2024)
Sparsity-Aware Roofline Models for Sparse Matrix-Matrix Multiplication
by: Qian, Matthew, et al.
Published: (2026)
by: Qian, Matthew, et al.
Published: (2026)
Power Aware Dynamic Reallocation For Inference
by: Jiang, Yiwei, et al.
Published: (2026)
by: Jiang, Yiwei, et al.
Published: (2026)
Understanding the Performance and Power of LLM Inferencing on Edge Accelerators
by: Arya, Mayank, et al.
Published: (2025)
by: Arya, Mayank, et al.
Published: (2025)
From Attention to Disaggregation: Tracing the Evolution of LLM Inference
by: Kumar, Madabattula Rajesh, et al.
Published: (2025)
by: Kumar, Madabattula Rajesh, et al.
Published: (2025)
Towards Resource-Efficient Serverless LLM Inference with SLINFER
by: Xu, Chuhao, et al.
Published: (2025)
by: Xu, Chuhao, et al.
Published: (2025)
Toward Sustainability-Aware LLM Inference on Edge Clusters
by: Rajashekar, Kolichala, et al.
Published: (2025)
by: Rajashekar, Kolichala, et al.
Published: (2025)
Distributed On-Device LLM Inference With Over-the-Air Computation
by: Zhang, Kai, et al.
Published: (2025)
by: Zhang, Kai, et al.
Published: (2025)
Efficient LLM Inference with Activation Checkpointing and Hybrid Caching
by: Lee, Sanghyeon, et al.
Published: (2025)
by: Lee, Sanghyeon, et al.
Published: (2025)
WANSpec: Leveraging Global Compute Capacity for LLM Inference
by: Martin, Noah, et al.
Published: (2026)
by: Martin, Noah, et al.
Published: (2026)
SYMPHONY: Improving Memory Management for LLM Inference Workloads
by: Agarwal, Saurabh, et al.
Published: (2024)
by: Agarwal, Saurabh, et al.
Published: (2024)
AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality
by: Bournias, Ilias, et al.
Published: (2024)
by: Bournias, Ilias, et al.
Published: (2024)
LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
by: Da, Wei, et al.
Published: (2026)
by: Da, Wei, et al.
Published: (2026)
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
by: Lin, Mao, et al.
Published: (2026)
by: Lin, Mao, et al.
Published: (2026)
A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference
by: Zhang, Yida, et al.
Published: (2026)
by: Zhang, Yida, et al.
Published: (2026)
Similar Items
-
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference
by: Lin, Shouxu, et al.
Published: (2026) -
VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
by: Liu, Zihan, et al.
Published: (2025) -
MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference
by: Yang, Zheming, et al.
Published: (2026) -
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
by: Wang, Jiamin, et al.
Published: (2026) -
Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching
by: Pang, Bowen, et al.
Published: (2025)