Saved in:
Bibliographic Details
Main Authors: Wang, Gongshu, Wang, Zhirui, Yang, Kan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.08036
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912701659217920
author Wang, Gongshu
Wang, Zhirui
Yang, Kan
author_facet Wang, Gongshu
Wang, Zhirui
Yang, Kan
contents Monocular depth estimation (MDE) has widely applicable but remains highly challenging due to the inherently ill-posed nature of reconstructing 3D scenes from single 2D images. Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities that benefit for various vision tasks. Recent studies have demonstrated significant improvements in MDE through fine-tuning these VFMs. Inspired by these developments, we propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights, while effec-tively eliciting and leveraging their inherent priors. Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels. Experiments on NYU-Depth v2 and KITTI datasets show that WEDepth establishes new state-of-the-art (SOTA) performance, achieving competi-tive results compared to both diffusion-based approaches (which require multiple forward passes) and methods pre-trained on relative depth. Furthermore, we demonstrate our method exhibits strong zero-shot transfer capability across diverse scenarios.
format Preprint
id arxiv_https___arxiv_org_abs_2511_08036
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation
Wang, Gongshu
Wang, Zhirui
Yang, Kan
Computer Vision and Pattern Recognition
Monocular depth estimation (MDE) has widely applicable but remains highly challenging due to the inherently ill-posed nature of reconstructing 3D scenes from single 2D images. Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities that benefit for various vision tasks. Recent studies have demonstrated significant improvements in MDE through fine-tuning these VFMs. Inspired by these developments, we propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights, while effec-tively eliciting and leveraging their inherent priors. Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels. Experiments on NYU-Depth v2 and KITTI datasets show that WEDepth establishes new state-of-the-art (SOTA) performance, achieving competi-tive results compared to both diffusion-based approaches (which require multiple forward passes) and methods pre-trained on relative depth. Furthermore, we demonstrate our method exhibits strong zero-shot transfer capability across diverse scenarios.
title WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2511.08036