Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Peng, Haosong, Feng, Wei, Li, Hao, Zhan, Yufeng, Jin, Ren, Xia, Yuanqing
Format:	Preprint
Published:	2024
Subjects:	Multimedia Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2404.09245
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909326811070464
author	Peng, Haosong Feng, Wei Li, Hao Zhan, Yufeng Jin, Ren Xia, Yuanqing
author_facet	Peng, Haosong Feng, Wei Li, Hao Zhan, Yufeng Jin, Ren Xia, Yuanqing
contents	The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest to the downstream models. Additionally, we design an adaptive keyframe inference switching algorithm tailored to different videos, capable of adapting to the current video content to jointly optimize accuracy and bandwidth. Through extensive experiments, our findings reveal that Arena can boost inference speeds by up to 1.58\(\times\) and 1.82\(\times\) on average while consuming only 47\% and 31\% of the bandwidth, respectively, all with high inference accuracy.
format	Preprint
id	arxiv_https___arxiv_org_abs_2404_09245
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics Peng, Haosong Feng, Wei Li, Hao Zhan, Yufeng Jin, Ren Xia, Yuanqing Multimedia Computer Vision and Pattern Recognition The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest to the downstream models. Additionally, we design an adaptive keyframe inference switching algorithm tailored to different videos, capable of adapting to the current video content to jointly optimize accuracy and bandwidth. Through extensive experiments, our findings reveal that Arena can boost inference speeds by up to 1.58\(\times\) and 1.82\(\times\) on average while consuming only 47\% and 31\% of the bandwidth, respectively, all with high inference accuracy.
title	Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics
topic	Multimedia Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2404.09245

Similar Items