Saved in:
| Main Authors: | , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.26241 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866916046681669632 |
|---|---|
| author | Zhang, Jiahao Liu, Joseph Lee, Young-Yoon Moon, Seonghyeon Zordan, Victor Tevet, Guy Liu, Karen Gould, Stephen Jacob, Oren Jiang, Haomiao Kapadia, Mubbasir Ben-Shabat, Yizhak |
| author_facet | Zhang, Jiahao Liu, Joseph Lee, Young-Yoon Moon, Seonghyeon Zordan, Victor Tevet, Guy Liu, Karen Gould, Stephen Jacob, Oren Jiang, Haomiao Kapadia, Mubbasir Ben-Shabat, Yizhak |
| contents | Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences. We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure enables fine-grained, per-category evaluation, that reveals model strengths and weaknesses obscured by global metrics. We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2605_26241 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation Zhang, Jiahao Liu, Joseph Lee, Young-Yoon Moon, Seonghyeon Zordan, Victor Tevet, Guy Liu, Karen Gould, Stephen Jacob, Oren Jiang, Haomiao Kapadia, Mubbasir Ben-Shabat, Yizhak Computer Vision and Pattern Recognition Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences. We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure enables fine-grained, per-category evaluation, that reveals model strengths and weaknesses obscured by global metrics. We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research. |
| title | RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2605.26241 |