Saved in:
| Main Authors: | Liu, Zhibang, Xu, Chaonong, Liu, Zhizhuo, Huang, Lekai, Wei, Jiachen, Li, Chao |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.07693 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Collaborative Inference Acceleration with Non-Penetrative Tensor Partitioning
by: Liu, Zhibang, et al.
Published: (2025)
by: Liu, Zhibang, et al.
Published: (2025)
PICO: Pipeline Inference Framework for Versatile CNNs on Diverse Mobile Devices
by: Yang, Xiang, et al.
Published: (2022)
by: Yang, Xiang, et al.
Published: (2022)
CoCoI: Distributed Coded Inference System for Straggler Mitigation
by: Liu, Xing, et al.
Published: (2025)
by: Liu, Xing, et al.
Published: (2025)
Learning the Optimal Path and DNN Partition for Collaborative Edge Inference
by: Huang, Yin, et al.
Published: (2024)
by: Huang, Yin, et al.
Published: (2024)
Where to Split? A Pareto-Front Analysis of DNN Partitioning for Edge Inference
by: Masud, Adiba, et al.
Published: (2026)
by: Masud, Adiba, et al.
Published: (2026)
Cooperative Gradient Coding
by: Weng, Shudi, et al.
Published: (2025)
by: Weng, Shudi, et al.
Published: (2025)
MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms
by: Duan, Jiaang, et al.
Published: (2024)
by: Duan, Jiaang, et al.
Published: (2024)
Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs
by: Chen, Aodong, et al.
Published: (2023)
by: Chen, Aodong, et al.
Published: (2023)
Accelerating End-Cloud Collaborative Inference via Near Bubble-free Pipeline Optimization
by: Gao, Luyao, et al.
Published: (2024)
by: Gao, Luyao, et al.
Published: (2024)
Many Hands Make Light Work: Accelerating Edge Inference via Multi-Client Collaborative Caching
by: Liang, Wenyi, et al.
Published: (2024)
by: Liang, Wenyi, et al.
Published: (2024)
WindGP: Efficient Graph Partitioning on Heterogenous Machines
by: Zeng, Li, et al.
Published: (2024)
by: Zeng, Li, et al.
Published: (2024)
FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
by: Liu, Xing, et al.
Published: (2025)
by: Liu, Xing, et al.
Published: (2025)
OnePiece: A Large-Scale Distributed Inference System with RDMA for Complex AI-Generated Content (AIGC) Workflows
by: Chen, June, et al.
Published: (2026)
by: Chen, June, et al.
Published: (2026)
RAPID: Redundancy-Aware and Compatibility-Optimal Edge-Cloud Partitioned Inference for Diverse VLA Models
by: Zheng, Zihao, et al.
Published: (2026)
by: Zheng, Zihao, et al.
Published: (2026)
SparseMap: Loop Mapping for Sparse CNNs on Streaming Coarse-grained Reconfigurable Array
by: Ni, Xiaobing, et al.
Published: (2024)
by: Ni, Xiaobing, et al.
Published: (2024)
MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference
by: Tang, Xinru, et al.
Published: (2025)
by: Tang, Xinru, et al.
Published: (2025)
HarmonyBatch: Batching multi-SLO DNN Inference with Heterogeneous Serverless Functions
by: Chen, Jiabin, et al.
Published: (2024)
by: Chen, Jiabin, et al.
Published: (2024)
Partition Detection in Byzantine Networks
by: Bromberg, Yérom-David, et al.
Published: (2024)
by: Bromberg, Yérom-David, et al.
Published: (2024)
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
by: Hu, Cunchen, et al.
Published: (2024)
by: Hu, Cunchen, et al.
Published: (2024)
Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning
by: Saba, Issa, et al.
Published: (2024)
by: Saba, Issa, et al.
Published: (2024)
SLO-Aware Scheduling for Large Language Model Inferences
by: Huang, Jinqi, et al.
Published: (2025)
by: Huang, Jinqi, et al.
Published: (2025)
Persistent and Partitioned MPI for Stencil Communication
by: Collom, Gerald, et al.
Published: (2025)
by: Collom, Gerald, et al.
Published: (2025)
Incidence Constraints in Hypergraph Partitioning on GPU
by: Ronzani, Marco, et al.
Published: (2026)
by: Ronzani, Marco, et al.
Published: (2026)
Scaling LLM Inference Beyond Amdahl`s Limits via Eliminating Non-Scalable Overheads
by: Zhao, Alan, et al.
Published: (2026)
by: Zhao, Alan, et al.
Published: (2026)
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
by: Gao, Wei, et al.
Published: (2026)
by: Gao, Wei, et al.
Published: (2026)
Large Language Model Partitioning for Low-Latency Inference at the Edge
by: Kafetzis, Dimitrios, et al.
Published: (2025)
by: Kafetzis, Dimitrios, et al.
Published: (2025)
Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025)
by: Gao, Luyao, et al.
Published: (2025)
YUHENG-OS: A Cloud-Native Space Cluster Operating System
by: Zhang, Jin, et al.
Published: (2026)
by: Zhang, Jin, et al.
Published: (2026)
HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference
by: Lin, Haoran, et al.
Published: (2025)
by: Lin, Haoran, et al.
Published: (2025)
FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism
by: Chen, Huamin, et al.
Published: (2026)
by: Chen, Huamin, et al.
Published: (2026)
A Distributed Partitioning Software and its Applications
by: Sasidharan, Aparna
Published: (2025)
by: Sasidharan, Aparna
Published: (2025)
Deterministic Parallel High-Quality Hypergraph Partitioning
by: Krause, Robert, et al.
Published: (2025)
by: Krause, Robert, et al.
Published: (2025)
PARD: Enhancing Goodput for Inference Pipeline via Proactive Request Dropping
by: Zhao, Zhixin, et al.
Published: (2026)
by: Zhao, Zhixin, et al.
Published: (2026)
FedQuad: Adaptive Layer-wise LoRA Deployment and Activation Quantization for Federated Fine-Tuning
by: Li, Rukuo, et al.
Published: (2025)
by: Li, Rukuo, et al.
Published: (2025)
inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference
by: Chen, Huamin, et al.
Published: (2026)
by: Chen, Huamin, et al.
Published: (2026)
SparOA: Sparse and Operator-aware Hybrid Scheduling for Edge DNN Inference
by: Zhang, Ziyang, et al.
Published: (2025)
by: Zhang, Ziyang, et al.
Published: (2025)
KV Cache Compression for Inference Efficiency in LLMs: A Review
by: Liu, Yanyu, et al.
Published: (2025)
by: Liu, Yanyu, et al.
Published: (2025)
Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
by: Wei, Jinhui, et al.
Published: (2025)
by: Wei, Jinhui, et al.
Published: (2025)
Automated Deep Neural Network Inference Partitioning for Distributed Embedded Systems
by: Kreß, Fabian, et al.
Published: (2024)
by: Kreß, Fabian, et al.
Published: (2024)
DAWN: Matrix Operation-Optimized Algorithm for Shortest Paths Problem on Unweighted Graphs
by: Feng, Yelai, et al.
Published: (2022)
by: Feng, Yelai, et al.
Published: (2022)
Similar Items
-
Collaborative Inference Acceleration with Non-Penetrative Tensor Partitioning
by: Liu, Zhibang, et al.
Published: (2025) -
PICO: Pipeline Inference Framework for Versatile CNNs on Diverse Mobile Devices
by: Yang, Xiang, et al.
Published: (2022) -
CoCoI: Distributed Coded Inference System for Straggler Mitigation
by: Liu, Xing, et al.
Published: (2025) -
Learning the Optimal Path and DNN Partition for Collaborative Edge Inference
by: Huang, Yin, et al.
Published: (2024) -
Where to Split? A Pareto-Front Analysis of DNN Partitioning for Edge Inference
by: Masud, Adiba, et al.
Published: (2026)