:: Library Catalog

Buchumschlag

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Bari, Agrim, Hegde, Parikshit, de Veciana, Gustavo
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Machine Learning Distributed, Parallel, and Cluster Computing
Online-Zugang:	https://arxiv.org/abs/2508.01002
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Ähnliche Einträge

ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference
von: Oh, Hyungjun, et al.
Veröffentlicht: (2024)

PecSched: Preemptive and Efficient Cluster Scheduling for LLM Inference
von: Zhang, Zeyu, et al.
Veröffentlicht: (2024)

Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference
von: Tian, Jian, et al.
Veröffentlicht: (2025)

Locality-aware Fair Scheduling in LLM Serving
von: Cao, Shiyi, et al.
Veröffentlicht: (2025)

Preble: Efficient Distributed Prompt Scheduling for LLM Serving
von: Srivatsa, Vikranth, et al.
Veröffentlicht: (2024)

Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference
von: Siavashi, Mohammad, et al.
Veröffentlicht: (2025)

Collaborative Speculative Inference for Efficient LLM Inference Serving
von: Gao, Luyao, et al.
Veröffentlicht: (2025)

Practical Performance Guarantees for Pipelined DNN Inference
von: Archer, Aaron, et al.
Veröffentlicht: (2023)

DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training
von: Qiang, Xinwei, et al.
Veröffentlicht: (2026)

Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing
von: Zhao, Lingxiao, et al.
Veröffentlicht: (2025)

MultiTASC++: A Continuously Adaptive Scheduler for Edge-Based Multi-Device Cascade Inference
von: Nikolaidis, Sokratis, et al.
Veröffentlicht: (2024)

Deadline-Aware Online Scheduling for LLM Fine-Tuning with Spot Market Predictions
von: Kong, Linggao, et al.
Veröffentlicht: (2025)

Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC
von: Wei, Xinming, et al.
Veröffentlicht: (2025)

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
von: Zhong, Shuzhang, et al.
Veröffentlicht: (2025)

STAR: Decode-Phase Rescheduling for LLM Inference
von: Wang, Zhibin, et al.
Veröffentlicht: (2025)

Pie: Pooling CPU Memory for LLM Inference
von: Xu, Yi, et al.
Veröffentlicht: (2024)

AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training
von: Bai, Huawei, et al.
Veröffentlicht: (2025)

A Tabular Schedule Abstraction for Communication-Aware Evaluation of Pipeline-Parallel LLM Training
von: Barley, Daniel, et al.
Veröffentlicht: (2026)

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips
von: Yu, Jiahuan, et al.
Veröffentlicht: (2026)

Learning the Optimal Path and DNN Partition for Collaborative Edge Inference
von: Huang, Yin, et al.
Veröffentlicht: (2024)

Prompt-Aware Scheduling for Efficient Text-to-Image Inferencing System
von: Agarwal, Shubham, et al.
Veröffentlicht: (2025)

FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving
von: Bin, Kyungmin, et al.
Veröffentlicht: (2025)

Understanding and Improving Communication Performance in Multi-node LLM Inference
von: Singhania, Prajwal, et al.
Veröffentlicht: (2025)

Making MoE-based LLM Inference Resilient with Tarragon
von: Zhang, Songyu, et al.
Veröffentlicht: (2026)

Floe: Federated Specialization for Real-Time LLM-SLM Inference
von: Tian, Chunlin, et al.
Veröffentlicht: (2026)

FDC: Fast KV Dimensionality Compression for Efficient LLM Inference
von: Zhang, Zeyu, et al.
Veröffentlicht: (2024)

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
von: Agrawal, Amey, et al.
Veröffentlicht: (2024)

Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference
von: Zhang, Haolin, et al.
Veröffentlicht: (2025)

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
von: Gond, Raja, et al.
Veröffentlicht: (2025)

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
von: Fu, Yao, et al.
Veröffentlicht: (2024)

Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
von: Recasens, Pol G., et al.
Veröffentlicht: (2025)

No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha
von: Agrawal, Amey, et al.
Veröffentlicht: (2024)

PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference
von: Ning, Rui, et al.
Veröffentlicht: (2026)

Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud
von: Ghosh, Himel
Veröffentlicht: (2024)

Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference
von: Deshmukh, Dhruv, et al.
Veröffentlicht: (2025)

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
von: Chen, Huamin, et al.
Veröffentlicht: (2026)

HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference
von: Zhang, Zeyu, et al.
Veröffentlicht: (2025)

MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines
von: Gao, Lei, et al.
Veröffentlicht: (2024)

KVDirect: Distributed Disaggregated LLM Inference
von: Chen, Shiyang, et al.
Veröffentlicht: (2024)

Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation
von: Fang, Jingzhi, et al.
Veröffentlicht: (2025)