Saved in:
| Main Author: | Erdil, Ege |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.04645 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Data movement limits to frontier model training
by: Erdil, Ege, et al.
Published: (2024)
by: Erdil, Ege, et al.
Published: (2024)
Going Forward-Forward in Distributed Deep Learning
by: Aktemur, Ege, et al.
Published: (2024)
by: Aktemur, Ege, et al.
Published: (2024)
Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025)
by: Gao, Luyao, et al.
Published: (2025)
Salted Inference: Enhancing Privacy while Maintaining Efficiency of Split Inference in Mobile Computing
by: Malekzadeh, Mohammad, et al.
Published: (2023)
by: Malekzadeh, Mohammad, et al.
Published: (2023)
Queue management for slo-oriented large language model serving
by: Patke, Archit, et al.
Published: (2024)
by: Patke, Archit, et al.
Published: (2024)
Accelerate Intermittent Deep Inference
by: Zhang, Ziliang
Published: (2024)
by: Zhang, Ziliang
Published: (2024)
Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
by: Rajbhandari, Samyam, et al.
Published: (2025)
by: Rajbhandari, Samyam, et al.
Published: (2025)
Split CNN Inference on Networked Microcontrollers
by: Lu, Junyu, et al.
Published: (2026)
by: Lu, Junyu, et al.
Published: (2026)
STAR: Decode-Phase Rescheduling for LLM Inference
by: Wang, Zhibin, et al.
Published: (2025)
by: Wang, Zhibin, et al.
Published: (2025)
Practical Performance Guarantees for Pipelined DNN Inference
by: Archer, Aaron, et al.
Published: (2023)
by: Archer, Aaron, et al.
Published: (2023)
Pie: Pooling CPU Memory for LLM Inference
by: Xu, Yi, et al.
Published: (2024)
by: Xu, Yi, et al.
Published: (2024)
Optimal Scheduling Algorithms for LLM Inference: Theory and Practice
by: Bari, Agrim, et al.
Published: (2025)
by: Bari, Agrim, et al.
Published: (2025)
Dynamic Rebatching for Efficient Early-Exit Inference with DREX
by: Liu, Xuting, et al.
Published: (2025)
by: Liu, Xuting, et al.
Published: (2025)
Fast Distributed Inference Serving for Large Language Models
by: Wu, Bingyang, et al.
Published: (2023)
by: Wu, Bingyang, et al.
Published: (2023)
A Survey on Collaborative DNN Inference for Edge Intelligence
by: Ren, Weiqing, et al.
Published: (2022)
by: Ren, Weiqing, et al.
Published: (2022)
Priority-Aware Model-Distributed Inference at Edge Networks
by: Li, Teng, et al.
Published: (2024)
by: Li, Teng, et al.
Published: (2024)
CascadeServe: Unlocking Model Cascades for Inference Serving
by: Kossmann, Ferdi, et al.
Published: (2024)
by: Kossmann, Ferdi, et al.
Published: (2024)
Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge
by: Koch, Fernando, et al.
Published: (2025)
by: Koch, Fernando, et al.
Published: (2025)
Understanding and Improving Communication Performance in Multi-node LLM Inference
by: Singhania, Prajwal, et al.
Published: (2025)
by: Singhania, Prajwal, et al.
Published: (2025)
Deal: Distributed End-to-End GNN Inference for All Nodes
by: Chen, Shiyang, et al.
Published: (2025)
by: Chen, Shiyang, et al.
Published: (2025)
Where Do the Joules Go? Diagnosing Inference Energy Consumption
by: Chung, Jae-Won, et al.
Published: (2026)
by: Chung, Jae-Won, et al.
Published: (2026)
Making MoE-based LLM Inference Resilient with Tarragon
by: Zhang, Songyu, et al.
Published: (2026)
by: Zhang, Songyu, et al.
Published: (2026)
cuConv: A CUDA Implementation of Convolution for CNN Inference
by: Jordà, Marc, et al.
Published: (2021)
by: Jordà, Marc, et al.
Published: (2021)
AntBatchInfer: Elastic Batch Inference in the Kubernetes Cluster
by: Li, Siyuan, et al.
Published: (2024)
by: Li, Siyuan, et al.
Published: (2024)
ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference
by: Oh, Hyungjun, et al.
Published: (2024)
by: Oh, Hyungjun, et al.
Published: (2024)
Floe: Federated Specialization for Real-Time LLM-SLM Inference
by: Tian, Chunlin, et al.
Published: (2026)
by: Tian, Chunlin, et al.
Published: (2026)
Adaptive Stream Processing on Edge Devices through Active Inference
by: Sedlak, Boris, et al.
Published: (2024)
by: Sedlak, Boris, et al.
Published: (2024)
FDC: Fast KV Dimensionality Compression for Efficient LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)
by: Zhang, Zeyu, et al.
Published: (2024)
PecSched: Preemptive and Efficient Cluster Scheduling for LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)
by: Zhang, Zeyu, et al.
Published: (2024)
Learning the Optimal Path and DNN Partition for Collaborative Edge Inference
by: Huang, Yin, et al.
Published: (2024)
by: Huang, Yin, et al.
Published: (2024)
Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
by: Prabhakar, Rohan Baskar, et al.
Published: (2024)
by: Prabhakar, Rohan Baskar, et al.
Published: (2024)
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
by: Agrawal, Amey, et al.
Published: (2024)
by: Agrawal, Amey, et al.
Published: (2024)
Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute
by: Oviedo, Felipe, et al.
Published: (2025)
by: Oviedo, Felipe, et al.
Published: (2025)
TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval
by: Lin, Chien-Yu, et al.
Published: (2025)
by: Lin, Chien-Yu, et al.
Published: (2025)
Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference
by: Zhang, Haolin, et al.
Published: (2025)
by: Zhang, Haolin, et al.
Published: (2025)
Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing
by: Liu, Mengfan, et al.
Published: (2025)
by: Liu, Mengfan, et al.
Published: (2025)
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
by: Gond, Raja, et al.
Published: (2025)
by: Gond, Raja, et al.
Published: (2025)
Designing Large Foundation Models for Efficient Training and Inference: A Survey
by: Liu, Dong, et al.
Published: (2024)
by: Liu, Dong, et al.
Published: (2024)
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
by: Fu, Yao, et al.
Published: (2024)
by: Fu, Yao, et al.
Published: (2024)
Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization
by: Wang, Chong, et al.
Published: (2026)
by: Wang, Chong, et al.
Published: (2026)
Similar Items
-
Data movement limits to frontier model training
by: Erdil, Ege, et al.
Published: (2024) -
Going Forward-Forward in Distributed Deep Learning
by: Aktemur, Ege, et al.
Published: (2024) -
Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025) -
Salted Inference: Enhancing Privacy while Maintaining Efficiency of Split Inference in Mobile Computing
by: Malekzadeh, Mohammad, et al.
Published: (2023) -
Queue management for slo-oriented large language model serving
by: Patke, Archit, et al.
Published: (2024)