Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Bugliarello, Emanuele, Arnab, Anurag, Paiss, Roni, Kindermans, Pieter-Jan, Schmid, Cordelia
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.04666
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910861633781760
author	Bugliarello, Emanuele Arnab, Anurag Paiss, Roni Kindermans, Pieter-Jan Schmid, Cordelia
author_facet	Bugliarello, Emanuele Arnab, Anurag Paiss, Roni Kindermans, Pieter-Jan Schmid, Cordelia
contents	High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human generation. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok and TED-Talks, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing `What Are You Doing?' (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1{,}544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose and validate automatic metrics that leverage our annotations and better capture human evaluations. Equipped with our dataset and metrics, we perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation, showing how WYD provides novel insights about the capabilities of these models. We release our data and code to drive forward progress in human video generation modeling at https://github.com/google-deepmind/wyd-benchmark.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_04666
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	What Are You Doing? A Closer Look at Controllable Human Video Generation Bugliarello, Emanuele Arnab, Anurag Paiss, Roni Kindermans, Pieter-Jan Schmid, Cordelia Computer Vision and Pattern Recognition High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human generation. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok and TED-Talks, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing `What Are You Doing?' (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1{,}544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose and validate automatic metrics that leverage our annotations and better capture human evaluations. Equipped with our dataset and metrics, we perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation, showing how WYD provides novel insights about the capabilities of these models. We release our data and code to drive forward progress in human video generation modeling at https://github.com/google-deepmind/wyd-benchmark.
title	What Are You Doing? A Closer Look at Controllable Human Video Generation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.04666

Similar Items