:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Erdil, Ege
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2506.04645
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Data movement limits to frontier model training
by: Erdil, Ege, et al.
Published: (2024)

Going Forward-Forward in Distributed Deep Learning
by: Aktemur, Ege, et al.
Published: (2024)

Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025)

Salted Inference: Enhancing Privacy while Maintaining Efficiency of Split Inference in Mobile Computing
by: Malekzadeh, Mohammad, et al.
Published: (2023)

Queue management for slo-oriented large language model serving
by: Patke, Archit, et al.
Published: (2024)

Accelerate Intermittent Deep Inference
by: Zhang, Ziliang
Published: (2024)

Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
by: Rajbhandari, Samyam, et al.
Published: (2025)

Split CNN Inference on Networked Microcontrollers
by: Lu, Junyu, et al.
Published: (2026)

STAR: Decode-Phase Rescheduling for LLM Inference
by: Wang, Zhibin, et al.
Published: (2025)

Practical Performance Guarantees for Pipelined DNN Inference
by: Archer, Aaron, et al.
Published: (2023)

Pie: Pooling CPU Memory for LLM Inference
by: Xu, Yi, et al.
Published: (2024)

Optimal Scheduling Algorithms for LLM Inference: Theory and Practice
by: Bari, Agrim, et al.
Published: (2025)

Dynamic Rebatching for Efficient Early-Exit Inference with DREX
by: Liu, Xuting, et al.
Published: (2025)

Fast Distributed Inference Serving for Large Language Models
by: Wu, Bingyang, et al.
Published: (2023)

A Survey on Collaborative DNN Inference for Edge Intelligence
by: Ren, Weiqing, et al.
Published: (2022)

Priority-Aware Model-Distributed Inference at Edge Networks
by: Li, Teng, et al.
Published: (2024)

CascadeServe: Unlocking Model Cascades for Inference Serving
by: Kossmann, Ferdi, et al.
Published: (2024)

Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge
by: Koch, Fernando, et al.
Published: (2025)

Understanding and Improving Communication Performance in Multi-node LLM Inference
by: Singhania, Prajwal, et al.
Published: (2025)

Deal: Distributed End-to-End GNN Inference for All Nodes
by: Chen, Shiyang, et al.
Published: (2025)

Where Do the Joules Go? Diagnosing Inference Energy Consumption
by: Chung, Jae-Won, et al.
Published: (2026)

Making MoE-based LLM Inference Resilient with Tarragon
by: Zhang, Songyu, et al.
Published: (2026)

cuConv: A CUDA Implementation of Convolution for CNN Inference
by: Jordà, Marc, et al.
Published: (2021)

AntBatchInfer: Elastic Batch Inference in the Kubernetes Cluster
by: Li, Siyuan, et al.
Published: (2024)

ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference
by: Oh, Hyungjun, et al.
Published: (2024)

Floe: Federated Specialization for Real-Time LLM-SLM Inference
by: Tian, Chunlin, et al.
Published: (2026)

Adaptive Stream Processing on Edge Devices through Active Inference
by: Sedlak, Boris, et al.
Published: (2024)

FDC: Fast KV Dimensionality Compression for Efficient LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)

PecSched: Preemptive and Efficient Cluster Scheduling for LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)

Learning the Optimal Path and DNN Partition for Collaborative Edge Inference
by: Huang, Yin, et al.
Published: (2024)

Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
by: Prabhakar, Rohan Baskar, et al.
Published: (2024)

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
by: Agrawal, Amey, et al.
Published: (2024)

Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute
by: Oviedo, Felipe, et al.
Published: (2025)

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval
by: Lin, Chien-Yu, et al.
Published: (2025)

Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference
by: Zhang, Haolin, et al.
Published: (2025)

Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing
by: Liu, Mengfan, et al.
Published: (2025)

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
by: Gond, Raja, et al.
Published: (2025)

Designing Large Foundation Models for Efficient Training and Inference: A Survey
by: Liu, Dong, et al.
Published: (2024)

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
by: Fu, Yao, et al.
Published: (2024)

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization
by: Wang, Chong, et al.
Published: (2026)