Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yu, Qing, Tanaka, Mikihiro, Fujiwara, Kent
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2405.04771
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916239092219904
author	Yu, Qing Tanaka, Mikihiro Fujiwara, Kent
author_facet	Yu, Qing Tanaka, Mikihiro Fujiwara, Kent
contents	To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2405_04771
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches Yu, Qing Tanaka, Mikihiro Fujiwara, Kent Computer Vision and Pattern Recognition To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.
title	Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2405.04771

Similar Items