Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Yilu, Zhu, Chenhui, Wang, Shuai, Wang, Hanlin, Wang, Jing, Zhang, Zhaoxiang, Wang, Limin
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2502.08234
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912230440697856
author	Wu, Yilu Zhu, Chenhui Wang, Shuai Wang, Hanlin Wang, Jing Zhang, Zhaoxiang Wang, Limin
author_facet	Wu, Yilu Zhu, Chenhui Wang, Shuai Wang, Hanlin Wang, Jing Zhang, Zhaoxiang Wang, Limin
contents	We are committed to learning human skill generators at key-step levels. The generation of skills is a challenging endeavor, but its successful implementation could greatly facilitate human skill learning and provide more experience for embodied intelligence. Although current video generation models can synthesis simple and atomic human operations, they struggle with human skills due to their complex procedure process. Human skills involve multi-step, long-duration actions and complex scene transitions, so the existing naive auto-regressive methods for synthesizing long videos cannot generate human skills. To address this, we propose a novel task, the Key-step Skill Generation (KS-Gen), aimed at reducing the complexity of generating human skill videos. Given the initial state and a skill description, the task is to generate video clips of key steps to complete the skill, rather than a full-length video. To support this task, we introduce a carefully curated dataset and define multiple evaluation metrics to assess performance. Considering the complexity of KS-Gen, we propose a new framework for this task. First, a multimodal large language model (MLLM) generates descriptions for key steps using retrieval argument. Subsequently, we use a Key-step Image Generator (KIG) to address the discontinuity between key steps in skill videos. Finally, a video generation model uses these descriptions and key-step images to generate video clips of the key steps with high temporal consistency. We offer a detailed analysis of the results, hoping to provide more insights on human skill generation. All models and data are available at https://github.com/MCG-NJU/KS-Gen.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_08234
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Learning Human Skill Generators at Key-Step Levels Wu, Yilu Zhu, Chenhui Wang, Shuai Wang, Hanlin Wang, Jing Zhang, Zhaoxiang Wang, Limin Computer Vision and Pattern Recognition We are committed to learning human skill generators at key-step levels. The generation of skills is a challenging endeavor, but its successful implementation could greatly facilitate human skill learning and provide more experience for embodied intelligence. Although current video generation models can synthesis simple and atomic human operations, they struggle with human skills due to their complex procedure process. Human skills involve multi-step, long-duration actions and complex scene transitions, so the existing naive auto-regressive methods for synthesizing long videos cannot generate human skills. To address this, we propose a novel task, the Key-step Skill Generation (KS-Gen), aimed at reducing the complexity of generating human skill videos. Given the initial state and a skill description, the task is to generate video clips of key steps to complete the skill, rather than a full-length video. To support this task, we introduce a carefully curated dataset and define multiple evaluation metrics to assess performance. Considering the complexity of KS-Gen, we propose a new framework for this task. First, a multimodal large language model (MLLM) generates descriptions for key steps using retrieval argument. Subsequently, we use a Key-step Image Generator (KIG) to address the discontinuity between key steps in skill videos. Finally, a video generation model uses these descriptions and key-step images to generate video clips of the key steps with high temporal consistency. We offer a detailed analysis of the results, hoping to provide more insights on human skill generation. All models and data are available at https://github.com/MCG-NJU/KS-Gen.
title	Learning Human Skill Generators at Key-Step Levels
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2502.08234

Similar Items