:: Library Catalog

Image de couverture de livre

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Lin, Yi-Chien, Kwon, Woosuk, Pineda, Ronald, Paravecino, Fanny Nina
Format:	Preprint
Publié:	2024
Sujets:	Distributed, Parallel, and Cluster Computing
Accès en ligne:	https://arxiv.org/abs/2411.17651
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

Documents similaires

APEX: Asynchronous Parallel CPU-GPU Execution for Online LLM Inference on Constrained GPUs
par: Fan, Jiakun, et autres
Publié: (2025)

DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving
par: Ruan, Chaoyi, et autres
Publié: (2025)

Jenga: Effective Memory Management for Serving LLM with Heterogeneity
par: Zhang, Chen, et autres
Publié: (2025)

SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving
par: Zhou, Qihui, et autres
Publié: (2025)

S-HPLB: Efficient LLM Attention Serving via Sparsity-Aware Head Parallelism Load Balance
par: Liu, Di, et autres
Publié: (2026)

Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
par: Srivatsa, Vikranth, et autres
Publié: (2026)

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
par: Guo, Tianyu, et autres
Publié: (2025)

Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism
par: Li, Cong, et autres
Publié: (2025)

CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
par: Li, Suyi, et autres
Publié: (2024)

SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling
par: Jaiswal, Shashwat, et autres
Publié: (2025)

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
par: Nian, Sean, et autres
Publié: (2026)

FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
par: Chen, Wenyan, et autres
Publié: (2026)

BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure
par: He, Yiyuan, et autres
Publié: (2025)

Hetis: Serving LLMs in Heterogeneous GPU Clusters with Fine-grained and Dynamic Parallelism
par: Mo, Zizhao, et autres
Publié: (2025)

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving
par: Nie, Chengyi, et autres
Publié: (2024)

Stardust: A Scalable and Extensible Simulator for the 3D Continuum
par: Pusztai, Thomas, et autres
Publié: (2025)

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving
par: Duan, Jiangfei, et autres
Publié: (2024)

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
par: Bu, Tianci, et autres
Publié: (2026)

PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving
par: Bai, Xu, et autres
Publié: (2026)

EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge
par: Cao, Jiahe, et autres
Publié: (2026)

Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving
par: Yoshimura, Takeshi, et autres
Publié: (2026)

Regulating Branch Parallelism in LLM Serving
par: Gandhi, Swapnil, et autres
Publié: (2026)

FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
par: Lin, Yanying, et autres
Publié: (2025)

ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
par: Ma, Tenghui, et autres
Publié: (2026)

Communication-Efficient Serving for Video Diffusion Models with Latent Parallelism
par: Wu, Zhiyuan, et autres
Publié: (2025)

HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
par: Peng, You, et autres
Publié: (2026)

CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing
par: Yuan, Yitao, et autres
Publié: (2025)

eLLM: Elastic Memory Management Framework for Efficient LLM Serving
par: Xu, Jiale, et autres
Publié: (2025)

TinyServe: Query-Aware Cache Selection for Efficient LLM Serving
par: Liu, Dong, et autres
Publié: (2025)

Parallel Spawning Strategies for Dynamic-Aware MPI Applications
par: Martín-Álvarez, Iker, et autres
Publié: (2025)

MixServe: An Automatic Distributed Serving System for MoE Models with Hybrid Parallelism Based on Fused Communication Algorithm
par: Zhou, Bowen, et autres
Publié: (2026)

FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving
par: Gao, Shouwei, et autres
Publié: (2026)

Unlock the Potential of Fine-grained LLM Serving via Dynamic Module Scaling
par: Wu, Jingfeng, et autres
Publié: (2025)

DualScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS
par: Basit, Omar, et autres
Publié: (2026)

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
par: Hu, Cunchen, et autres
Publié: (2024)

EconoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving
par: Shen, Haiying, et autres
Publié: (2024)

ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
par: Jiang, Youhe, et autres
Publié: (2025)

Predictable LLM Serving on GPU Clusters
par: Darzi, Erfan, et autres
Publié: (2025)

Efficient Parallel Execution of Blockchain Transactions Leveraging Conflict Specifications
par: Anjana, Parwat Singh, et autres
Publié: (2025)

Boosting LLM Serving through Spatial-Temporal GPU Resource Sharing
par: Lin, Zejia, et autres
Publié: (2025)