Saved in:
Bibliographic Details
Main Authors: Chappa, Naga VS Raviteja, Nguyen, Pha, Nelson, Alexander H, Seo, Han-Seok, Li, Xin, Dobbs, Page Daniel, Luu, Khoa
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2305.06310
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910703637495808
author Chappa, Naga VS Raviteja
Nguyen, Pha
Nelson, Alexander H
Seo, Han-Seok
Li, Xin
Dobbs, Page Daniel
Luu, Khoa
author_facet Chappa, Naga VS Raviteja
Nguyen, Pha
Nelson, Alexander H
Seo, Han-Seok
Li, Xin
Dobbs, Page Daniel
Luu, Khoa
contents This paper introduces a novel approach to Social Group Activity Recognition (SoGAR) using Self-supervised Transformers network that can effectively utilize unlabeled video data. To extract spatio-temporal information, we created local and global views with varying frame rates. Our self-supervised objective ensures that features extracted from contrasting views of the same video were consistent across spatio-temporal domains. Our proposed approach is efficient in using transformer-based encoders to alleviate the weakly supervised setting of group activity recognition. By leveraging the benefits of transformer models, our approach can model long-term relationships along spatio-temporal dimensions. Our proposed SoGAR method achieved state-of-the-art results on three group activity recognition benchmarks, namely JRDB-PAR, NBA, and Volleyball datasets, surpassing the current numbers in terms of F1-score, MCA, and MPCA metrics.
format Preprint
id arxiv_https___arxiv_org_abs_2305_06310
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition
Chappa, Naga VS Raviteja
Nguyen, Pha
Nelson, Alexander H
Seo, Han-Seok
Li, Xin
Dobbs, Page Daniel
Luu, Khoa
Computer Vision and Pattern Recognition
This paper introduces a novel approach to Social Group Activity Recognition (SoGAR) using Self-supervised Transformers network that can effectively utilize unlabeled video data. To extract spatio-temporal information, we created local and global views with varying frame rates. Our self-supervised objective ensures that features extracted from contrasting views of the same video were consistent across spatio-temporal domains. Our proposed approach is efficient in using transformer-based encoders to alleviate the weakly supervised setting of group activity recognition. By leveraging the benefits of transformer models, our approach can model long-term relationships along spatio-temporal dimensions. Our proposed SoGAR method achieved state-of-the-art results on three group activity recognition benchmarks, namely JRDB-PAR, NBA, and Volleyball datasets, surpassing the current numbers in terms of F1-score, MCA, and MPCA metrics.
title SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2305.06310