Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Ashutosh, Kumar, Wang, XuDong, Yin, Xi, Grauman, Kristen, Polyak, Adam, Misra, Ishan, Girdhar, Rohit
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Computer Vision and Pattern Recognition
Online-Zugang:	https://arxiv.org/abs/2601.14037
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866915742213996544
author	Ashutosh, Kumar Wang, XuDong Yin, Xi Grauman, Kristen Polyak, Adam Misra, Ishan Girdhar, Rohit
author_facet	Ashutosh, Kumar Wang, XuDong Yin, Xi Grauman, Kristen Polyak, Adam Misra, Ishan Girdhar, Rohit
contents	Video generation models have recently achieved impressive visual fidelity and temporal coherence. Yet, they continue to struggle with complex, non-rigid motions, especially when synthesizing humans performing dynamic actions such as sports, dance, etc. Generated videos often exhibit missing or extra limbs, distorted poses, or physically implausible actions. In this work, we propose a remarkably simple reward model, HuDA, to quantify and improve the human motion in generated videos. HuDA integrates human detection confidence for appearance quality, and a temporal prompt alignment score to capture motion realism. We show this simple reward function that leverages off-the-shelf models without any additional training, outperforms specialized models finetuned with manually annotated data. Using HuDA for Group Reward Policy Optimization (GRPO) post-training of video models, we significantly enhance video generation, especially when generating complex human motions, outperforming state-of-the-art models like Wan 2.1, with win-rate of 73%. Finally, we demonstrate that HuDA improves generation quality beyond just humans, for instance, significantly improving generation of animal videos and human-object interactions.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_14037
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Human detectors are surprisingly powerful reward models Ashutosh, Kumar Wang, XuDong Yin, Xi Grauman, Kristen Polyak, Adam Misra, Ishan Girdhar, Rohit Computer Vision and Pattern Recognition Video generation models have recently achieved impressive visual fidelity and temporal coherence. Yet, they continue to struggle with complex, non-rigid motions, especially when synthesizing humans performing dynamic actions such as sports, dance, etc. Generated videos often exhibit missing or extra limbs, distorted poses, or physically implausible actions. In this work, we propose a remarkably simple reward model, HuDA, to quantify and improve the human motion in generated videos. HuDA integrates human detection confidence for appearance quality, and a temporal prompt alignment score to capture motion realism. We show this simple reward function that leverages off-the-shelf models without any additional training, outperforms specialized models finetuned with manually annotated data. Using HuDA for Group Reward Policy Optimization (GRPO) post-training of video models, we significantly enhance video generation, especially when generating complex human motions, outperforming state-of-the-art models like Wan 2.1, with win-rate of 73%. Finally, we demonstrate that HuDA improves generation quality beyond just humans, for instance, significantly improving generation of animal videos and human-object interactions.
title	Human detectors are surprisingly powerful reward models
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2601.14037

Ähnliche Einträge