Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Baldassarre, Federico, Szafraniec, Marc, Terver, Basile, Khalidov, Vasil, Massa, Francisco, LeCun, Yann, Labatut, Patrick, Seitzer, Maximilian, Bojanowski, Piotr
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.19468
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916864093847552
author	Baldassarre, Federico Szafraniec, Marc Terver, Basile Khalidov, Vasil Massa, Francisco LeCun, Yann Labatut, Patrick Seitzer, Maximilian Bojanowski, Piotr
author_facet	Baldassarre, Federico Szafraniec, Marc Terver, Basile Khalidov, Vasil Massa, Francisco LeCun, Yann Labatut, Patrick Seitzer, Maximilian Bojanowski, Piotr
contents	We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_19468
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Back to the Features: DINO as a Foundation for Video World Models Baldassarre, Federico Szafraniec, Marc Terver, Basile Khalidov, Vasil Massa, Francisco LeCun, Yann Labatut, Patrick Seitzer, Maximilian Bojanowski, Piotr Computer Vision and Pattern Recognition We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.
title	Back to the Features: DINO as a Foundation for Video World Models
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2507.19468

Similar Items