Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Mingdian, Liu, Yilin, Krishnan, Gurunandan, Bayer, Karl S, Zhou, Bing
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2409.13251
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912036735156224
author	Liu, Mingdian Liu, Yilin Krishnan, Gurunandan Bayer, Karl S Zhou, Bing
author_facet	Liu, Mingdian Liu, Yilin Krishnan, Gurunandan Bayer, Karl S Zhou, Bing
contents	The generation of humanoid animation from text prompts can profoundly impact animation production and AR/VR experiences. However, existing methods only generate body motion data, excluding facial expressions and hand movements. This limitation, primarily due to a lack of a comprehensive whole-body motion dataset, inhibits their readiness for production use. Recent attempts to create such a dataset have resulted in either motion inconsistency among different body parts in the artificially augmented data or lower quality in the data extracted from RGB videos. In this work, we propose T2M-X, a two-stage method that learns expressive text-to-motion generation from partially annotated data. T2M-X trains three separate Vector Quantized Variational AutoEncoders (VQ-VAEs) for body, hand, and face on respective high-quality data sources to ensure high-quality motion outputs, and a Multi-indexing Generative Pretrained Transformer (GPT) model with motion consistency loss for motion generation and coordination among different body parts. Our results show significant improvements over the baselines both quantitatively and qualitatively, demonstrating its robustness against the dataset limitations.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_13251
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data Liu, Mingdian Liu, Yilin Krishnan, Gurunandan Bayer, Karl S Zhou, Bing Computer Vision and Pattern Recognition The generation of humanoid animation from text prompts can profoundly impact animation production and AR/VR experiences. However, existing methods only generate body motion data, excluding facial expressions and hand movements. This limitation, primarily due to a lack of a comprehensive whole-body motion dataset, inhibits their readiness for production use. Recent attempts to create such a dataset have resulted in either motion inconsistency among different body parts in the artificially augmented data or lower quality in the data extracted from RGB videos. In this work, we propose T2M-X, a two-stage method that learns expressive text-to-motion generation from partially annotated data. T2M-X trains three separate Vector Quantized Variational AutoEncoders (VQ-VAEs) for body, hand, and face on respective high-quality data sources to ensure high-quality motion outputs, and a Multi-indexing Generative Pretrained Transformer (GPT) model with motion consistency loss for motion generation and coordination among different body parts. Our results show significant improvements over the baselines both quantitatively and qualitatively, demonstrating its robustness against the dataset limitations.
title	T2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2409.13251

Similar Items