Saved in:
Bibliographic Details
Main Authors: Wang, Letian, Kim, Seung Wook, Yang, Jiawei, Yu, Cunjun, Ivanovic, Boris, Waslander, Steven L., Wang, Yue, Fidler, Sanja, Pavone, Marco, Karkus, Peter
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.12095
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909372989308928
author Wang, Letian
Kim, Seung Wook
Yang, Jiawei
Yu, Cunjun
Ivanovic, Boris
Waslander, Steven L.
Wang, Yue
Fidler, Sanja
Pavone, Marco
Karkus, Peter
author_facet Wang, Letian
Kim, Seung Wook
Yang, Jiawei
Yu, Cunjun
Ivanovic, Boris
Waslander, Steven L.
Wang, Yue
Fidler, Sanja
Pavone, Marco
Karkus, Peter
contents We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in outdoor autonomous driving scenes. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs with limited view overlap, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets from them, which helps our model to learn enhanced 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes and Waymo NOTR datasets demonstrate that DistillNeRF significantly outperforms existing comparable state-of-the-art self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/.
format Preprint
id arxiv_https___arxiv_org_abs_2406_12095
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features
Wang, Letian
Kim, Seung Wook
Yang, Jiawei
Yu, Cunjun
Ivanovic, Boris
Waslander, Steven L.
Wang, Yue
Fidler, Sanja
Pavone, Marco
Karkus, Peter
Computer Vision and Pattern Recognition
Artificial Intelligence
Robotics
We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in outdoor autonomous driving scenes. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs with limited view overlap, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets from them, which helps our model to learn enhanced 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes and Waymo NOTR datasets demonstrate that DistillNeRF significantly outperforms existing comparable state-of-the-art self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/.
title DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features
topic Computer Vision and Pattern Recognition
Artificial Intelligence
Robotics
url https://arxiv.org/abs/2406.12095