Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	T, Mukund Varma, Wang, Peihao, Fan, Zhiwen, Wang, Zhangyang, Su, Hao, Ramamoorthi, Ravi
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2403.18922
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916182049685504
author	T, Mukund Varma Wang, Peihao Fan, Zhiwen Wang, Zhangyang Su, Hao Ramamoorthi, Ravi
author_facet	T, Mukund Varma Wang, Peihao Fan, Zhiwen Wang, Zhangyang Su, Hao Ramamoorthi, Ravi
contents	In recent years, there has been an explosion of 2D vision models for numerous tasks such as semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets. At the same time, there has been renewed interest in 3D scene representations such as neural radiance fields from multi-view images. However, the availability of 3D or multiview data is still substantially limited compared to 2D image datasets, making extending 2D vision models to 3D data highly desirable but also very challenging. Indeed, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task and often requires per-scene optimization. In this paper, we ask the question of whether any 2D vision model can be lifted to make 3D consistent predictions. We answer this question in the affirmative; our new Lift3D method trains to predict unseen views on feature spaces generated by a few visual models (i.e. DINO and CLIP), but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization; for some of these tasks, there is no comparable previous 3D method. In many cases, we even outperform state-of-the-art methods specialized for the task in question. Moreover, Lift3D is a zero-shot method, in the sense that it requires no task-specific training, nor scene-specific optimization.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_18922
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D T, Mukund Varma Wang, Peihao Fan, Zhiwen Wang, Zhangyang Su, Hao Ramamoorthi, Ravi Computer Vision and Pattern Recognition In recent years, there has been an explosion of 2D vision models for numerous tasks such as semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets. At the same time, there has been renewed interest in 3D scene representations such as neural radiance fields from multi-view images. However, the availability of 3D or multiview data is still substantially limited compared to 2D image datasets, making extending 2D vision models to 3D data highly desirable but also very challenging. Indeed, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task and often requires per-scene optimization. In this paper, we ask the question of whether any 2D vision model can be lifted to make 3D consistent predictions. We answer this question in the affirmative; our new Lift3D method trains to predict unseen views on feature spaces generated by a few visual models (i.e. DINO and CLIP), but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization; for some of these tasks, there is no comparable previous 3D method. In many cases, we even outperform state-of-the-art methods specialized for the task in question. Moreover, Lift3D is a zero-shot method, in the sense that it requires no task-specific training, nor scene-specific optimization.
title	Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2403.18922

Similar Items