Saved in:
Bibliographic Details
Main Authors: Zhang, Jiahao, Liu, Joseph, Lee, Young-Yoon, Moon, Seonghyeon, Zordan, Victor, Tevet, Guy, Liu, Karen, Gould, Stephen, Jacob, Oren, Jiang, Haomiao, Kapadia, Mubbasir, Ben-Shabat, Yizhak
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.26241
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916046681669632
author Zhang, Jiahao
Liu, Joseph
Lee, Young-Yoon
Moon, Seonghyeon
Zordan, Victor
Tevet, Guy
Liu, Karen
Gould, Stephen
Jacob, Oren
Jiang, Haomiao
Kapadia, Mubbasir
Ben-Shabat, Yizhak
author_facet Zhang, Jiahao
Liu, Joseph
Lee, Young-Yoon
Moon, Seonghyeon
Zordan, Victor
Tevet, Guy
Liu, Karen
Gould, Stephen
Jacob, Oren
Jiang, Haomiao
Kapadia, Mubbasir
Ben-Shabat, Yizhak
contents Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences. We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure enables fine-grained, per-category evaluation, that reveals model strengths and weaknesses obscured by global metrics. We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.
format Preprint
id arxiv_https___arxiv_org_abs_2605_26241
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation
Zhang, Jiahao
Liu, Joseph
Lee, Young-Yoon
Moon, Seonghyeon
Zordan, Victor
Tevet, Guy
Liu, Karen
Gould, Stephen
Jacob, Oren
Jiang, Haomiao
Kapadia, Mubbasir
Ben-Shabat, Yizhak
Computer Vision and Pattern Recognition
Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences. We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure enables fine-grained, per-category evaluation, that reveals model strengths and weaknesses obscured by global metrics. We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.
title RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2605.26241