Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shaikh, Muhammad Bilal, Islam, Syed Mohammed Shamsul, Chai, Douglas, Akhtar, Naveed
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition A.1; I.2.10
Online Access:	https://arxiv.org/abs/2405.15813
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911886822342656
author	Shaikh, Muhammad Bilal Islam, Syed Mohammed Shamsul Chai, Douglas Akhtar, Naveed
author_facet	Shaikh, Muhammad Bilal Islam, Syed Mohammed Shamsul Chai, Douglas Akhtar, Naveed
contents	Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the last decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of "fusing" the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.
format	Preprint
id	arxiv_https___arxiv_org_abs_2405_15813
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	From CNNs to Transformers in Multimodal Human Action Recognition: A Survey Shaikh, Muhammad Bilal Islam, Syed Mohammed Shamsul Chai, Douglas Akhtar, Naveed Computer Vision and Pattern Recognition A.1; I.2.10 Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the last decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of "fusing" the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.
title	From CNNs to Transformers in Multimodal Human Action Recognition: A Survey
topic	Computer Vision and Pattern Recognition A.1; I.2.10
url	https://arxiv.org/abs/2405.15813

Similar Items