Saved in:
Bibliographic Details
Main Authors: Sun, Xiangyu, Jiang, Haoyi, Liu, Liu, Nam, Seungtae, Kang, Gyeongjin, Wang, Xinjie, Sui, Wei, Su, Zhizhong, Liu, Wenyu, Wang, Xinggang, Park, Eunbyung
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2508.03643
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908907650154496
author Sun, Xiangyu
Jiang, Haoyi
Liu, Liu
Nam, Seungtae
Kang, Gyeongjin
Wang, Xinjie
Sui, Wei
Su, Zhizhong
Liu, Wenyu
Wang, Xinggang
Park, Eunbyung
author_facet Sun, Xiangyu
Jiang, Haoyi
Liu, Liu
Nam, Seungtae
Kang, Gyeongjin
Wang, Xinjie
Sui, Wei
Su, Zhizhong
Liu, Wenyu
Wang, Xinggang
Park, Eunbyung
contents Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction, all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding. The code is available at https://github.com/HorizonRobotics/Uni3R.
format Preprint
id arxiv_https___arxiv_org_abs_2508_03643
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
Sun, Xiangyu
Jiang, Haoyi
Liu, Liu
Nam, Seungtae
Kang, Gyeongjin
Wang, Xinjie
Sui, Wei
Su, Zhizhong
Liu, Wenyu
Wang, Xinggang
Park, Eunbyung
Computer Vision and Pattern Recognition
Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction, all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding. The code is available at https://github.com/HorizonRobotics/Uni3R.
title Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2508.03643