Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Deng, Zhichao, Li, Xiangtai, Li, Xia, Tong, Yunhai, Zhao, Shen, Liu, Mengyuan
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Robotics
Online Access:	https://arxiv.org/abs/2404.11605
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911844207165440
author	Deng, Zhichao Li, Xiangtai Li, Xia Tong, Yunhai Zhao, Shen Liu, Mengyuan
author_facet	Deng, Zhichao Li, Xiangtai Li, Xia Tong, Yunhai Zhao, Shen Liu, Mengyuan
contents	Understanding the real world through point cloud video is a crucial aspect of robotics and autonomous driving systems. However, prevailing methods for 4D point cloud recognition have limitations due to sensor resolution, which leads to a lack of detailed information. Recent advances have shown that Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts that can be transferred to various downstream tasks. However, effectively integrating VLM into the domain of 4D point clouds remains an unresolved problem. In this work, we propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network. Our approach involves aligning the 4D encoder's representation with a VLM to learn a shared visual and text space from training on large-scale image-text pairs. By transferring the knowledge of the VLM to the 4D encoder and combining the VLM, our VG4D achieves improved recognition performance. To enhance the 4D encoder, we modernize the classic dynamic point cloud backbone and propose an improved version of PSTNet, im-PSTNet, which can efficiently model point cloud videos. Experiments demonstrate that our method achieves state-of-the-art performance for action recognition on both the NTU RGB+D 60 dataset and the NTU RGB+D 120 dataset. Code is available at \url{https://github.com/Shark0-0/VG4D}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2404_11605
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	VG4D: Vision-Language Model Goes 4D Video Recognition Deng, Zhichao Li, Xiangtai Li, Xia Tong, Yunhai Zhao, Shen Liu, Mengyuan Computer Vision and Pattern Recognition Artificial Intelligence Robotics Understanding the real world through point cloud video is a crucial aspect of robotics and autonomous driving systems. However, prevailing methods for 4D point cloud recognition have limitations due to sensor resolution, which leads to a lack of detailed information. Recent advances have shown that Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts that can be transferred to various downstream tasks. However, effectively integrating VLM into the domain of 4D point clouds remains an unresolved problem. In this work, we propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network. Our approach involves aligning the 4D encoder's representation with a VLM to learn a shared visual and text space from training on large-scale image-text pairs. By transferring the knowledge of the VLM to the 4D encoder and combining the VLM, our VG4D achieves improved recognition performance. To enhance the 4D encoder, we modernize the classic dynamic point cloud backbone and propose an improved version of PSTNet, im-PSTNet, which can efficiently model point cloud videos. Experiments demonstrate that our method achieves state-of-the-art performance for action recognition on both the NTU RGB+D 60 dataset and the NTU RGB+D 120 dataset. Code is available at \url{https://github.com/Shark0-0/VG4D}.
title	VG4D: Vision-Language Model Goes 4D Video Recognition
topic	Computer Vision and Pattern Recognition Artificial Intelligence Robotics
url	https://arxiv.org/abs/2404.11605

Similar Items