Saved in:
Bibliographic Details
Main Authors: Tan, Shuo, Liu, Rui, Han, Xuesong, Long, XianLei, Wan, Kai, Song, Linqi, Li, Yong
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2411.01579
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918136516706304
author Tan, Shuo
Liu, Rui
Han, Xuesong
Long, XianLei
Wan, Kai
Song, Linqi
Li, Yong
author_facet Tan, Shuo
Liu, Rui
Han, Xuesong
Long, XianLei
Wan, Kai
Song, Linqi
Li, Yong
contents Deploying Convolutional Neural Networks (CNNs) on resource-constrained devices necessitates efficient management of computational resources, often via distributed environments susceptible to latency from straggler nodes. This paper introduces the Flexible Coded Distributed Convolution Computing (FCDCC) framework to enhance straggler resilience and numerical stability in distributed CNNs. We extend Coded Distributed Computing (CDC) with Circulant and Rotation Matrix Embedding (CRME) which was originally proposed for matrix multiplication to high-dimensional tensor convolution. For the proposed scheme, referred to as the Numerically Stable Coded Tensor Convolution (NSCTC) scheme, we also propose two new coded partitioning schemes: Adaptive-Padding Coded Partitioning (APCP) for the input tensor and Kernel-Channel Coded Partitioning (KCCP) for the filter tensor. These strategies enable linear decomposition of tensor convolutions and encoding them into CDC subtasks, combining model parallelism with coded redundancy for robust and efficient execution. Theoretical analysis identifies an optimal trade-off between communication and storage costs. Empirical results validate the framework's effectiveness in computational efficiency, straggler resilience, and scalability across various CNN architectures.
format Preprint
id arxiv_https___arxiv_org_abs_2411_01579
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Flexible Coded Distributed Convolution Computing for Enhanced Straggler Resilience and Numerical Stability in Distributed CNNs
Tan, Shuo
Liu, Rui
Han, Xuesong
Long, XianLei
Wan, Kai
Song, Linqi
Li, Yong
Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Computer Vision and Pattern Recognition
Information Theory
Machine Learning
Deploying Convolutional Neural Networks (CNNs) on resource-constrained devices necessitates efficient management of computational resources, often via distributed environments susceptible to latency from straggler nodes. This paper introduces the Flexible Coded Distributed Convolution Computing (FCDCC) framework to enhance straggler resilience and numerical stability in distributed CNNs. We extend Coded Distributed Computing (CDC) with Circulant and Rotation Matrix Embedding (CRME) which was originally proposed for matrix multiplication to high-dimensional tensor convolution. For the proposed scheme, referred to as the Numerically Stable Coded Tensor Convolution (NSCTC) scheme, we also propose two new coded partitioning schemes: Adaptive-Padding Coded Partitioning (APCP) for the input tensor and Kernel-Channel Coded Partitioning (KCCP) for the filter tensor. These strategies enable linear decomposition of tensor convolutions and encoding them into CDC subtasks, combining model parallelism with coded redundancy for robust and efficient execution. Theoretical analysis identifies an optimal trade-off between communication and storage costs. Empirical results validate the framework's effectiveness in computational efficiency, straggler resilience, and scalability across various CNN architectures.
title Flexible Coded Distributed Convolution Computing for Enhanced Straggler Resilience and Numerical Stability in Distributed CNNs
topic Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Computer Vision and Pattern Recognition
Information Theory
Machine Learning
url https://arxiv.org/abs/2411.01579