Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Sun, Guanxiong, Hua, Yang, Hu, Guosheng, Robertson, Neil
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2402.09257
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917590261039104
author	Sun, Guanxiong Hua, Yang Hu, Guosheng Robertson, Neil
author_facet	Sun, Guanxiong Hua, Yang Hu, Guosheng Robertson, Neil
contents	Deep video models, for example, 3D CNNs or video transformers, have achieved promising performance on sparse video tasks, i.e., predicting one result per video. However, challenges arise when adapting existing deep video models to dense video tasks, i.e., predicting one result per frame. Specifically, these models are expensive for deployment, less effective when handling redundant frames, and difficult to capture long-range temporal correlations. To overcome these issues, we propose a Temporal Dilated Video Transformer (TDViT) that consists of carefully designed temporal dilated transformer blocks (TDTB). TDTB can efficiently extract spatiotemporal representations and effectively alleviate the negative effect of temporal redundancy. Furthermore, by using hierarchical TDTBs, our approach obtains an exponentially expanded temporal receptive field and therefore can model long-range dynamics. Extensive experiments are conducted on two different dense video benchmarks, i.e., ImageNet VID for video object detection and YouTube VIS for video instance segmentation. Excellent experimental results demonstrate the superior efficiency, effectiveness, and compatibility of our method. The code is available at https://github.com/guanxiongsun/vfe.pytorch.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_09257
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	TDViT: Temporal Dilated Video Transformer for Dense Video Tasks Sun, Guanxiong Hua, Yang Hu, Guosheng Robertson, Neil Computer Vision and Pattern Recognition Deep video models, for example, 3D CNNs or video transformers, have achieved promising performance on sparse video tasks, i.e., predicting one result per video. However, challenges arise when adapting existing deep video models to dense video tasks, i.e., predicting one result per frame. Specifically, these models are expensive for deployment, less effective when handling redundant frames, and difficult to capture long-range temporal correlations. To overcome these issues, we propose a Temporal Dilated Video Transformer (TDViT) that consists of carefully designed temporal dilated transformer blocks (TDTB). TDTB can efficiently extract spatiotemporal representations and effectively alleviate the negative effect of temporal redundancy. Furthermore, by using hierarchical TDTBs, our approach obtains an exponentially expanded temporal receptive field and therefore can model long-range dynamics. Extensive experiments are conducted on two different dense video benchmarks, i.e., ImageNet VID for video object detection and YouTube VIS for video instance segmentation. Excellent experimental results demonstrate the superior efficiency, effectiveness, and compatibility of our method. The code is available at https://github.com/guanxiongsun/vfe.pytorch.
title	TDViT: Temporal Dilated Video Transformer for Dense Video Tasks
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2402.09257

Similar Items