:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Zhibang, Xu, Chaonong, Liu, Zhizhuo, Huang, Lekai, Wei, Jiachen, Li, Chao
Format:	Preprint
Published:	2024
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2409.07693
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Collaborative Inference Acceleration with Non-Penetrative Tensor Partitioning
by: Liu, Zhibang, et al.
Published: (2025)

PICO: Pipeline Inference Framework for Versatile CNNs on Diverse Mobile Devices
by: Yang, Xiang, et al.
Published: (2022)

CoCoI: Distributed Coded Inference System for Straggler Mitigation
by: Liu, Xing, et al.
Published: (2025)

Learning the Optimal Path and DNN Partition for Collaborative Edge Inference
by: Huang, Yin, et al.
Published: (2024)

Where to Split? A Pareto-Front Analysis of DNN Partitioning for Edge Inference
by: Masud, Adiba, et al.
Published: (2026)

Cooperative Gradient Coding
by: Weng, Shudi, et al.
Published: (2025)

MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms
by: Duan, Jiaang, et al.
Published: (2024)

Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs
by: Chen, Aodong, et al.
Published: (2023)

Accelerating End-Cloud Collaborative Inference via Near Bubble-free Pipeline Optimization
by: Gao, Luyao, et al.
Published: (2024)

Many Hands Make Light Work: Accelerating Edge Inference via Multi-Client Collaborative Caching
by: Liang, Wenyi, et al.
Published: (2024)

WindGP: Efficient Graph Partitioning on Heterogenous Machines
by: Zeng, Li, et al.
Published: (2024)

FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
by: Liu, Xing, et al.
Published: (2025)

OnePiece: A Large-Scale Distributed Inference System with RDMA for Complex AI-Generated Content (AIGC) Workflows
by: Chen, June, et al.
Published: (2026)

RAPID: Redundancy-Aware and Compatibility-Optimal Edge-Cloud Partitioned Inference for Diverse VLA Models
by: Zheng, Zihao, et al.
Published: (2026)

SparseMap: Loop Mapping for Sparse CNNs on Streaming Coarse-grained Reconfigurable Array
by: Ni, Xiaobing, et al.
Published: (2024)

MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference
by: Tang, Xinru, et al.
Published: (2025)

HarmonyBatch: Batching multi-SLO DNN Inference with Heterogeneous Serverless Functions
by: Chen, Jiabin, et al.
Published: (2024)

Partition Detection in Byzantine Networks
by: Bromberg, Yérom-David, et al.
Published: (2024)

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
by: Hu, Cunchen, et al.
Published: (2024)

Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning
by: Saba, Issa, et al.
Published: (2024)

SLO-Aware Scheduling for Large Language Model Inferences
by: Huang, Jinqi, et al.
Published: (2025)

Persistent and Partitioned MPI for Stencil Communication
by: Collom, Gerald, et al.
Published: (2025)

Incidence Constraints in Hypergraph Partitioning on GPU
by: Ronzani, Marco, et al.
Published: (2026)

Scaling LLM Inference Beyond Amdahl`s Limits via Eliminating Non-Scalable Overheads
by: Zhao, Alan, et al.
Published: (2026)

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
by: Gao, Wei, et al.
Published: (2026)

Large Language Model Partitioning for Low-Latency Inference at the Edge
by: Kafetzis, Dimitrios, et al.
Published: (2025)

Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025)

YUHENG-OS: A Cloud-Native Space Cluster Operating System
by: Zhang, Jin, et al.
Published: (2026)

HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference
by: Lin, Haoran, et al.
Published: (2025)

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism
by: Chen, Huamin, et al.
Published: (2026)

A Distributed Partitioning Software and its Applications
by: Sasidharan, Aparna
Published: (2025)

Deterministic Parallel High-Quality Hypergraph Partitioning
by: Krause, Robert, et al.
Published: (2025)

PARD: Enhancing Goodput for Inference Pipeline via Proactive Request Dropping
by: Zhao, Zhixin, et al.
Published: (2026)

FedQuad: Adaptive Layer-wise LoRA Deployment and Activation Quantization for Federated Fine-Tuning
by: Li, Rukuo, et al.
Published: (2025)

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference
by: Chen, Huamin, et al.
Published: (2026)

SparOA: Sparse and Operator-aware Hybrid Scheduling for Edge DNN Inference
by: Zhang, Ziyang, et al.
Published: (2025)

KV Cache Compression for Inference Efficiency in LLMs: A Review
by: Liu, Yanyu, et al.
Published: (2025)

Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
by: Wei, Jinhui, et al.
Published: (2025)

Automated Deep Neural Network Inference Partitioning for Distributed Embedded Systems
by: Kreß, Fabian, et al.
Published: (2024)

DAWN: Matrix Operation-Optimized Algorithm for Shortest Paths Problem on Unweighted Graphs
by: Feng, Yelai, et al.
Published: (2022)