Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Appelle, Aaron, Lynch, Jerome P.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2510.20182
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909864927690752
author	Appelle, Aaron Lynch, Jerome P.
author_facet	Appelle, Aaron Lynch, Jerome P.
contents	Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_20182
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories Appelle, Aaron Lynch, Jerome P. Computer Vision and Pattern Recognition Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.
title	Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2510.20182

Similar Items