Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Letian, Kim, Seung Wook, Yang, Jiawei, Yu, Cunjun, Ivanovic, Boris, Waslander, Steven L., Wang, Yue, Fidler, Sanja, Pavone, Marco, Karkus, Peter
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Robotics
Online Access:	https://arxiv.org/abs/2406.12095
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909372989308928
author	Wang, Letian Kim, Seung Wook Yang, Jiawei Yu, Cunjun Ivanovic, Boris Waslander, Steven L. Wang, Yue Fidler, Sanja Pavone, Marco Karkus, Peter
author_facet	Wang, Letian Kim, Seung Wook Yang, Jiawei Yu, Cunjun Ivanovic, Boris Waslander, Steven L. Wang, Yue Fidler, Sanja Pavone, Marco Karkus, Peter
contents	We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in outdoor autonomous driving scenes. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs with limited view overlap, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets from them, which helps our model to learn enhanced 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes and Waymo NOTR datasets demonstrate that DistillNeRF significantly outperforms existing comparable state-of-the-art self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_12095
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features Wang, Letian Kim, Seung Wook Yang, Jiawei Yu, Cunjun Ivanovic, Boris Waslander, Steven L. Wang, Yue Fidler, Sanja Pavone, Marco Karkus, Peter Computer Vision and Pattern Recognition Artificial Intelligence Robotics We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in outdoor autonomous driving scenes. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs with limited view overlap, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets from them, which helps our model to learn enhanced 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes and Waymo NOTR datasets demonstrate that DistillNeRF significantly outperforms existing comparable state-of-the-art self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/.
title	DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features
topic	Computer Vision and Pattern Recognition Artificial Intelligence Robotics
url	https://arxiv.org/abs/2406.12095

Similar Items