Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cui, Wenqing, Li, Zhenyu, Lavreniuk, Mykola, Shi, Jian, Idoughi, Ramzi, Tang, Xiangjun, Wonka, Peter
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.03026
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917309606526976
author	Cui, Wenqing Li, Zhenyu Lavreniuk, Mykola Shi, Jian Idoughi, Ramzi Tang, Xiangjun Wonka, Peter
author_facet	Cui, Wenqing Li, Zhenyu Lavreniuk, Mykola Shi, Jian Idoughi, Ramzi Tang, Xiangjun Wonka, Peter
contents	Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation, reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36 degrees to 18.51 degrees, while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_03026
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Any Resolution Any Geometry: From Multi-View To Multi-Patch Cui, Wenqing Li, Zhenyu Lavreniuk, Mykola Shi, Jian Idoughi, Ramzi Tang, Xiangjun Wonka, Peter Computer Vision and Pattern Recognition Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation, reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36 degrees to 18.51 degrees, while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.
title	Any Resolution Any Geometry: From Multi-View To Multi-Patch
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2603.03026

Similar Items