Saved in:
Bibliographic Details
Main Authors: Sun, Guanxiong, Hua, Yang, Hu, Guosheng, Robertson, Neil
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.09257
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917590261039104
author Sun, Guanxiong
Hua, Yang
Hu, Guosheng
Robertson, Neil
author_facet Sun, Guanxiong
Hua, Yang
Hu, Guosheng
Robertson, Neil
contents Deep video models, for example, 3D CNNs or video transformers, have achieved promising performance on sparse video tasks, i.e., predicting one result per video. However, challenges arise when adapting existing deep video models to dense video tasks, i.e., predicting one result per frame. Specifically, these models are expensive for deployment, less effective when handling redundant frames, and difficult to capture long-range temporal correlations. To overcome these issues, we propose a Temporal Dilated Video Transformer (TDViT) that consists of carefully designed temporal dilated transformer blocks (TDTB). TDTB can efficiently extract spatiotemporal representations and effectively alleviate the negative effect of temporal redundancy. Furthermore, by using hierarchical TDTBs, our approach obtains an exponentially expanded temporal receptive field and therefore can model long-range dynamics. Extensive experiments are conducted on two different dense video benchmarks, i.e., ImageNet VID for video object detection and YouTube VIS for video instance segmentation. Excellent experimental results demonstrate the superior efficiency, effectiveness, and compatibility of our method. The code is available at https://github.com/guanxiongsun/vfe.pytorch.
format Preprint
id arxiv_https___arxiv_org_abs_2402_09257
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle TDViT: Temporal Dilated Video Transformer for Dense Video Tasks
Sun, Guanxiong
Hua, Yang
Hu, Guosheng
Robertson, Neil
Computer Vision and Pattern Recognition
Deep video models, for example, 3D CNNs or video transformers, have achieved promising performance on sparse video tasks, i.e., predicting one result per video. However, challenges arise when adapting existing deep video models to dense video tasks, i.e., predicting one result per frame. Specifically, these models are expensive for deployment, less effective when handling redundant frames, and difficult to capture long-range temporal correlations. To overcome these issues, we propose a Temporal Dilated Video Transformer (TDViT) that consists of carefully designed temporal dilated transformer blocks (TDTB). TDTB can efficiently extract spatiotemporal representations and effectively alleviate the negative effect of temporal redundancy. Furthermore, by using hierarchical TDTBs, our approach obtains an exponentially expanded temporal receptive field and therefore can model long-range dynamics. Extensive experiments are conducted on two different dense video benchmarks, i.e., ImageNet VID for video object detection and YouTube VIS for video instance segmentation. Excellent experimental results demonstrate the superior efficiency, effectiveness, and compatibility of our method. The code is available at https://github.com/guanxiongsun/vfe.pytorch.
title TDViT: Temporal Dilated Video Transformer for Dense Video Tasks
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2402.09257