Enregistré dans:
Détails bibliographiques
Auteurs principaux: Qi, Di, Yang, Tong, Wang, Beining, Zhang, Xiangyu, Zhang, Wenqiang
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2501.16617
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866912206943158272
author Qi, Di
Yang, Tong
Wang, Beining
Zhang, Xiangyu
Zhang, Wenqiang
author_facet Qi, Di
Yang, Tong
Wang, Beining
Zhang, Xiangyu
Zhang, Wenqiang
contents We present a novel framework for dynamic radiance field prediction given monocular video streams. Unlike previous methods that primarily focus on predicting future frames, our method goes a step further by generating explicit 3D representations of the dynamic scene. The framework builds on two core designs. First, we adopt an ego-centric unbounded triplane to explicitly represent the dynamic physical world. Second, we develop a 4D-aware transformer to aggregate features from monocular videos to update the triplane. Coupling these two designs enables us to train the proposed model with large-scale monocular videos in a self-supervised manner. Our model achieves top results in dynamic radiance field prediction on NVIDIA dynamic scenes, demonstrating its strong performance on 4D physical world modeling. Besides, our model shows a superior generalizability to unseen scenarios. Notably, we find that our approach emerges capabilities for geometry and semantic learning.
format Preprint
id arxiv_https___arxiv_org_abs_2501_16617
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Predicting 3D representations for Dynamic Scenes
Qi, Di
Yang, Tong
Wang, Beining
Zhang, Xiangyu
Zhang, Wenqiang
Computer Vision and Pattern Recognition
We present a novel framework for dynamic radiance field prediction given monocular video streams. Unlike previous methods that primarily focus on predicting future frames, our method goes a step further by generating explicit 3D representations of the dynamic scene. The framework builds on two core designs. First, we adopt an ego-centric unbounded triplane to explicitly represent the dynamic physical world. Second, we develop a 4D-aware transformer to aggregate features from monocular videos to update the triplane. Coupling these two designs enables us to train the proposed model with large-scale monocular videos in a self-supervised manner. Our model achieves top results in dynamic radiance field prediction on NVIDIA dynamic scenes, demonstrating its strong performance on 4D physical world modeling. Besides, our model shows a superior generalizability to unseen scenarios. Notably, we find that our approach emerges capabilities for geometry and semantic learning.
title Predicting 3D representations for Dynamic Scenes
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2501.16617