Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.08036 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912701659217920 |
|---|---|
| author | Wang, Gongshu Wang, Zhirui Yang, Kan |
| author_facet | Wang, Gongshu Wang, Zhirui Yang, Kan |
| contents | Monocular depth estimation (MDE) has widely applicable but remains highly challenging due to the inherently ill-posed nature of reconstructing 3D scenes from single 2D images. Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities that benefit for various vision tasks. Recent studies have demonstrated significant improvements in MDE through fine-tuning these VFMs. Inspired by these developments, we propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights, while effec-tively eliciting and leveraging their inherent priors. Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels. Experiments on NYU-Depth v2 and KITTI datasets show that WEDepth establishes new state-of-the-art (SOTA) performance, achieving competi-tive results compared to both diffusion-based approaches (which require multiple forward passes) and methods pre-trained on relative depth. Furthermore, we demonstrate our method exhibits strong zero-shot transfer capability across diverse scenarios. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2511_08036 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation Wang, Gongshu Wang, Zhirui Yang, Kan Computer Vision and Pattern Recognition Monocular depth estimation (MDE) has widely applicable but remains highly challenging due to the inherently ill-posed nature of reconstructing 3D scenes from single 2D images. Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities that benefit for various vision tasks. Recent studies have demonstrated significant improvements in MDE through fine-tuning these VFMs. Inspired by these developments, we propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights, while effec-tively eliciting and leveraging their inherent priors. Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels. Experiments on NYU-Depth v2 and KITTI datasets show that WEDepth establishes new state-of-the-art (SOTA) performance, achieving competi-tive results compared to both diffusion-based approaches (which require multiple forward passes) and methods pre-trained on relative depth. Furthermore, we demonstrate our method exhibits strong zero-shot transfer capability across diverse scenarios. |
| title | WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2511.08036 |