Λεπτομερής προβολή:

Αποθηκεύτηκε σε:

Λεπτομέρειες βιβλιογραφικής εγγραφής
Κύριοι συγγραφείς:	Pei, Gensheng, Chen, Tao, Jiang, Xiruo, Liu, Huafeng, Sun, Zeren, Yao, Yazhou
Μορφή:	Preprint
Έκδοση:	2024
Θέματα:	Computer Vision and Pattern Recognition
Διαθέσιμο Online:	https://arxiv.org/abs/2402.19082
Ετικέτες:	Προσθήκη ετικέτας Δεν υπάρχουν, Καταχωρήστε ετικέτα πρώτοι!

_version_	1866929259398823936
author	Pei, Gensheng Chen, Tao Jiang, Xiruo Liu, Huafeng Sun, Zeren Yao, Yazhou
author_facet	Pei, Gensheng Chen, Tao Jiang, Xiruo Liu, Huafeng Sun, Zeren Yao, Yazhou
contents	Recently, the advancement of self-supervised learning techniques, like masked autoencoders (MAE), has greatly influenced visual representation learning for images and videos. Nevertheless, it is worth noting that the predominant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper, we propose a new approach termed as \textbf{VideoMAC}, which combines video masked autoencoders with resource-friendly ConvNets. Specifically, VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation, we utilize ConvNets which are implemented with sparse convolutional operators as encoders. Simultaneously, we present a simple yet effective masked video modeling (MVM) approach, a dual encoder architecture comprising an online encoder and an exponential moving average target encoder, aimed to facilitate inter-frame reconstruction consistency in videos. Additionally, we demonstrate that VideoMAC, empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM, outperforms ViT-based approaches on downstream tasks, including video object segmentation (+\textbf{5.2\%} / \textbf{6.4\%} $\mathcal{J}\&\mathcal{F}$), body part propagation (+\textbf{6.3\%} / \textbf{3.1\%} mIoU), and human pose tracking (+\textbf{10.2\%} / \textbf{11.1\%} PCK@0.1).
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_19082
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	VideoMAC: Video Masked Autoencoders Meet ConvNets Pei, Gensheng Chen, Tao Jiang, Xiruo Liu, Huafeng Sun, Zeren Yao, Yazhou Computer Vision and Pattern Recognition Recently, the advancement of self-supervised learning techniques, like masked autoencoders (MAE), has greatly influenced visual representation learning for images and videos. Nevertheless, it is worth noting that the predominant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper, we propose a new approach termed as \textbf{VideoMAC}, which combines video masked autoencoders with resource-friendly ConvNets. Specifically, VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation, we utilize ConvNets which are implemented with sparse convolutional operators as encoders. Simultaneously, we present a simple yet effective masked video modeling (MVM) approach, a dual encoder architecture comprising an online encoder and an exponential moving average target encoder, aimed to facilitate inter-frame reconstruction consistency in videos. Additionally, we demonstrate that VideoMAC, empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM, outperforms ViT-based approaches on downstream tasks, including video object segmentation (+\textbf{5.2\%} / \textbf{6.4\%} $\mathcal{J}\&\mathcal{F}$), body part propagation (+\textbf{6.3\%} / \textbf{3.1\%} mIoU), and human pose tracking (+\textbf{10.2\%} / \textbf{11.1\%} PCK@0.1).
title	VideoMAC: Video Masked Autoencoders Meet ConvNets
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2402.19082

Παρόμοια τεκμήρια