Saved in:
Bibliographic Details
Main Authors: Baldassarre, Federico, Szafraniec, Marc, Terver, Basile, Khalidov, Vasil, Massa, Francisco, LeCun, Yann, Labatut, Patrick, Seitzer, Maximilian, Bojanowski, Piotr
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.19468
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916864093847552
author Baldassarre, Federico
Szafraniec, Marc
Terver, Basile
Khalidov, Vasil
Massa, Francisco
LeCun, Yann
Labatut, Patrick
Seitzer, Maximilian
Bojanowski, Piotr
author_facet Baldassarre, Federico
Szafraniec, Marc
Terver, Basile
Khalidov, Vasil
Massa, Francisco
LeCun, Yann
Labatut, Patrick
Seitzer, Maximilian
Bojanowski, Piotr
contents We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.
format Preprint
id arxiv_https___arxiv_org_abs_2507_19468
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Back to the Features: DINO as a Foundation for Video World Models
Baldassarre, Federico
Szafraniec, Marc
Terver, Basile
Khalidov, Vasil
Massa, Francisco
LeCun, Yann
Labatut, Patrick
Seitzer, Maximilian
Bojanowski, Piotr
Computer Vision and Pattern Recognition
We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.
title Back to the Features: DINO as a Foundation for Video World Models
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2507.19468