:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Xuting, Alexander, Daniel, Kakarla, Siva Kesava Reddy, Arzani, Behnaz, Liu, Vincent
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing Machine Learning
Online Access:	https://arxiv.org/abs/2512.15705
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

A Performance Analyzer for a Public Cloud's ML-Augmented VM Allocator
by: Bostandoost, Roozbeh, et al.
Published: (2025)

Towards Safer Heuristics With XPlain
by: Karimi, Pantea, et al.
Published: (2024)

Federated Learning for Collaborative Inference Systems: The Case of Early Exit Networks
by: Kaplan, Caelin, et al.
Published: (2024)

Distributed Inference on Mobile Edge and Cloud: An Early Exit based Clustering Approach
by: Bajpai, Divya Jyoti, et al.
Published: (2024)

Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
by: Dai, Yinwei, et al.
Published: (2023)

DistrEE: Distributed Early Exit of Deep Neural Network Inference on Edge Devices
by: Peng, Xian, et al.
Published: (2025)

Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025)

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
by: Chen, Yanxi, et al.
Published: (2023)

Early-Exit meets Model-Distributed Inference at Edge Networks
by: Colocrese, Marco, et al.
Published: (2024)

Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things
by: Wang, Ziheng, et al.
Published: (2024)

Recurrent Early Exits for Federated Learning with Heterogeneous Clients
by: Lee, Royson, et al.
Published: (2024)

Designing Large Foundation Models for Efficient Training and Inference: A Survey
by: Liu, Dong, et al.
Published: (2024)

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
by: Gupta, Vima, et al.
Published: (2024)

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
by: Lee, Wonbeom, et al.
Published: (2024)

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval
by: Lin, Chien-Yu, et al.
Published: (2025)

DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism
by: Niu, Yifan, et al.
Published: (2026)

Fast Distributed Inference Serving for Large Language Models
by: Wu, Bingyang, et al.
Published: (2023)

Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
by: Rajbhandari, Samyam, et al.
Published: (2025)

Enhancing Split Computing and Early Exit Applications through Predefined Sparsity
by: Capogrosso, Luigi, et al.
Published: (2024)

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)
by: Zhang, Tianyi, et al.
Published: (2025)

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization
by: Li, Jinhao, et al.
Published: (2023)

Practical Performance Guarantees for Pipelined DNN Inference
by: Archer, Aaron, et al.
Published: (2023)

Split CNN Inference on Networked Microcontrollers
by: Lu, Junyu, et al.
Published: (2026)

FDC: Fast KV Dimensionality Compression for Efficient LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)

PecSched: Preemptive and Efficient Cluster Scheduling for LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)

Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
by: Prabhakar, Rohan Baskar, et al.
Published: (2024)

Harvest: Adaptive Photonic Switching Schedules for Collective Communication in Scale-up Domains
by: Rahman, Mahir, et al.
Published: (2026)

Pie: Pooling CPU Memory for LLM Inference
by: Xu, Yi, et al.
Published: (2024)

Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing
by: Liu, Mengfan, et al.
Published: (2025)

I-SplitEE: Image classification in Split Computing DNNs with Early Exits
by: Bajpai, Divya Jyoti, et al.
Published: (2024)

Deal: Distributed End-to-End GNN Inference for All Nodes
by: Chen, Shiyang, et al.
Published: (2025)

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
by: Gond, Raja, et al.
Published: (2025)

Scalable and Cost-Efficient ML Inference: Parallel Batch Processing with Serverless Functions
by: Barrak, Amine, et al.
Published: (2025)

PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference
by: Ning, Rui, et al.
Published: (2026)

Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud
by: Ghosh, Himel
Published: (2024)

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
by: Chen, Huamin, et al.
Published: (2026)

DNN-Powered MLOps Pipeline Optimization for Large Language Models: A Framework for Automated Deployment and Resource Management
by: Krishnamoorthy, Mahesh Vaijainthymala, et al.
Published: (2025)

AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference
by: Zhao, Xuanlei, et al.
Published: (2024)

ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference
by: Meng, Han, et al.
Published: (2026)

PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices
by: Liu, Yangyijian, et al.
Published: (2025)