Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Gongshu, Wang, Zhirui, Yang, Kan
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2511.08036
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912701659217920
author	Wang, Gongshu Wang, Zhirui Yang, Kan
author_facet	Wang, Gongshu Wang, Zhirui Yang, Kan
contents	Monocular depth estimation (MDE) has widely applicable but remains highly challenging due to the inherently ill-posed nature of reconstructing 3D scenes from single 2D images. Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities that benefit for various vision tasks. Recent studies have demonstrated significant improvements in MDE through fine-tuning these VFMs. Inspired by these developments, we propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights, while effec-tively eliciting and leveraging their inherent priors. Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels. Experiments on NYU-Depth v2 and KITTI datasets show that WEDepth establishes new state-of-the-art (SOTA) performance, achieving competi-tive results compared to both diffusion-based approaches (which require multiple forward passes) and methods pre-trained on relative depth. Furthermore, we demonstrate our method exhibits strong zero-shot transfer capability across diverse scenarios.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_08036
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation Wang, Gongshu Wang, Zhirui Yang, Kan Computer Vision and Pattern Recognition Monocular depth estimation (MDE) has widely applicable but remains highly challenging due to the inherently ill-posed nature of reconstructing 3D scenes from single 2D images. Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities that benefit for various vision tasks. Recent studies have demonstrated significant improvements in MDE through fine-tuning these VFMs. Inspired by these developments, we propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights, while effec-tively eliciting and leveraging their inherent priors. Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels. Experiments on NYU-Depth v2 and KITTI datasets show that WEDepth establishes new state-of-the-art (SOTA) performance, achieving competi-tive results compared to both diffusion-based approaches (which require multiple forward passes) and methods pre-trained on relative depth. Furthermore, we demonstrate our method exhibits strong zero-shot transfer capability across diverse scenarios.
title	WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2511.08036

Similar Items