Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.19468 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866916864093847552 |
|---|---|
| author | Baldassarre, Federico Szafraniec, Marc Terver, Basile Khalidov, Vasil Massa, Francisco LeCun, Yann Labatut, Patrick Seitzer, Maximilian Bojanowski, Piotr |
| author_facet | Baldassarre, Federico Szafraniec, Marc Terver, Basile Khalidov, Vasil Massa, Francisco LeCun, Yann Labatut, Patrick Seitzer, Maximilian Bojanowski, Piotr |
| contents | We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2507_19468 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Back to the Features: DINO as a Foundation for Video World Models Baldassarre, Federico Szafraniec, Marc Terver, Basile Khalidov, Vasil Massa, Francisco LeCun, Yann Labatut, Patrick Seitzer, Maximilian Bojanowski, Piotr Computer Vision and Pattern Recognition We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space. |
| title | Back to the Features: DINO as a Foundation for Video World Models |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2507.19468 |