Saved in:
| Main Authors: | Liu, Xuting, Alexander, Daniel, Kakarla, Siva Kesava Reddy, Arzani, Behnaz, Liu, Vincent |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.15705 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A Performance Analyzer for a Public Cloud's ML-Augmented VM Allocator
by: Bostandoost, Roozbeh, et al.
Published: (2025)
by: Bostandoost, Roozbeh, et al.
Published: (2025)
Towards Safer Heuristics With XPlain
by: Karimi, Pantea, et al.
Published: (2024)
by: Karimi, Pantea, et al.
Published: (2024)
Federated Learning for Collaborative Inference Systems: The Case of Early Exit Networks
by: Kaplan, Caelin, et al.
Published: (2024)
by: Kaplan, Caelin, et al.
Published: (2024)
Distributed Inference on Mobile Edge and Cloud: An Early Exit based Clustering Approach
by: Bajpai, Divya Jyoti, et al.
Published: (2024)
by: Bajpai, Divya Jyoti, et al.
Published: (2024)
Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
by: Dai, Yinwei, et al.
Published: (2023)
by: Dai, Yinwei, et al.
Published: (2023)
DistrEE: Distributed Early Exit of Deep Neural Network Inference on Edge Devices
by: Peng, Xian, et al.
Published: (2025)
by: Peng, Xian, et al.
Published: (2025)
Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025)
by: Gao, Luyao, et al.
Published: (2025)
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
by: Chen, Yanxi, et al.
Published: (2023)
by: Chen, Yanxi, et al.
Published: (2023)
Early-Exit meets Model-Distributed Inference at Edge Networks
by: Colocrese, Marco, et al.
Published: (2024)
by: Colocrese, Marco, et al.
Published: (2024)
Adaptive Resolution Inference (ARI): Energy-Efficient Machine Learning for Internet of Things
by: Wang, Ziheng, et al.
Published: (2024)
by: Wang, Ziheng, et al.
Published: (2024)
Recurrent Early Exits for Federated Learning with Heterogeneous Clients
by: Lee, Royson, et al.
Published: (2024)
by: Lee, Royson, et al.
Published: (2024)
Designing Large Foundation Models for Efficient Training and Inference: A Survey
by: Liu, Dong, et al.
Published: (2024)
by: Liu, Dong, et al.
Published: (2024)
Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
by: Gupta, Vima, et al.
Published: (2024)
by: Gupta, Vima, et al.
Published: (2024)
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
by: Lee, Wonbeom, et al.
Published: (2024)
by: Lee, Wonbeom, et al.
Published: (2024)
TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval
by: Lin, Chien-Yu, et al.
Published: (2025)
by: Lin, Chien-Yu, et al.
Published: (2025)
DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism
by: Niu, Yifan, et al.
Published: (2026)
by: Niu, Yifan, et al.
Published: (2026)
Fast Distributed Inference Serving for Large Language Models
by: Wu, Bingyang, et al.
Published: (2023)
by: Wu, Bingyang, et al.
Published: (2023)
Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
by: Rajbhandari, Samyam, et al.
Published: (2025)
by: Rajbhandari, Samyam, et al.
Published: (2025)
Enhancing Split Computing and Early Exit Applications through Predefined Sparsity
by: Capogrosso, Luigi, et al.
Published: (2024)
by: Capogrosso, Luigi, et al.
Published: (2024)
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)
by: Zhang, Tianyi, et al.
Published: (2025)
by: Zhang, Tianyi, et al.
Published: (2025)
Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization
by: Li, Jinhao, et al.
Published: (2023)
by: Li, Jinhao, et al.
Published: (2023)
Practical Performance Guarantees for Pipelined DNN Inference
by: Archer, Aaron, et al.
Published: (2023)
by: Archer, Aaron, et al.
Published: (2023)
Split CNN Inference on Networked Microcontrollers
by: Lu, Junyu, et al.
Published: (2026)
by: Lu, Junyu, et al.
Published: (2026)
FDC: Fast KV Dimensionality Compression for Efficient LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)
by: Zhang, Zeyu, et al.
Published: (2024)
PecSched: Preemptive and Efficient Cluster Scheduling for LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)
by: Zhang, Zeyu, et al.
Published: (2024)
Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
by: Prabhakar, Rohan Baskar, et al.
Published: (2024)
by: Prabhakar, Rohan Baskar, et al.
Published: (2024)
Harvest: Adaptive Photonic Switching Schedules for Collective Communication in Scale-up Domains
by: Rahman, Mahir, et al.
Published: (2026)
by: Rahman, Mahir, et al.
Published: (2026)
Pie: Pooling CPU Memory for LLM Inference
by: Xu, Yi, et al.
Published: (2024)
by: Xu, Yi, et al.
Published: (2024)
Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing
by: Liu, Mengfan, et al.
Published: (2025)
by: Liu, Mengfan, et al.
Published: (2025)
I-SplitEE: Image classification in Split Computing DNNs with Early Exits
by: Bajpai, Divya Jyoti, et al.
Published: (2024)
by: Bajpai, Divya Jyoti, et al.
Published: (2024)
Deal: Distributed End-to-End GNN Inference for All Nodes
by: Chen, Shiyang, et al.
Published: (2025)
by: Chen, Shiyang, et al.
Published: (2025)
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
by: Gond, Raja, et al.
Published: (2025)
by: Gond, Raja, et al.
Published: (2025)
Scalable and Cost-Efficient ML Inference: Parallel Batch Processing with Serverless Functions
by: Barrak, Amine, et al.
Published: (2025)
by: Barrak, Amine, et al.
Published: (2025)
PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference
by: Ning, Rui, et al.
Published: (2026)
by: Ning, Rui, et al.
Published: (2026)
Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud
by: Ghosh, Himel
Published: (2024)
by: Ghosh, Himel
Published: (2024)
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
by: Chen, Huamin, et al.
Published: (2026)
by: Chen, Huamin, et al.
Published: (2026)
DNN-Powered MLOps Pipeline Optimization for Large Language Models: A Framework for Automated Deployment and Resource Management
by: Krishnamoorthy, Mahesh Vaijainthymala, et al.
Published: (2025)
by: Krishnamoorthy, Mahesh Vaijainthymala, et al.
Published: (2025)
AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference
by: Zhao, Xuanlei, et al.
Published: (2024)
by: Zhao, Xuanlei, et al.
Published: (2024)
ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference
by: Meng, Han, et al.
Published: (2026)
by: Meng, Han, et al.
Published: (2026)
PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices
by: Liu, Yangyijian, et al.
Published: (2025)
by: Liu, Yangyijian, et al.
Published: (2025)
Similar Items
-
A Performance Analyzer for a Public Cloud's ML-Augmented VM Allocator
by: Bostandoost, Roozbeh, et al.
Published: (2025) -
Towards Safer Heuristics With XPlain
by: Karimi, Pantea, et al.
Published: (2024) -
Federated Learning for Collaborative Inference Systems: The Case of Early Exit Networks
by: Kaplan, Caelin, et al.
Published: (2024) -
Distributed Inference on Mobile Edge and Cloud: An Early Exit based Clustering Approach
by: Bajpai, Divya Jyoti, et al.
Published: (2024) -
Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
by: Dai, Yinwei, et al.
Published: (2023)