Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Yiming, Chai, Lucy, Luo, Xuan, Niemeyer, Michael, Lagunas, Manuel, Lombardi, Stephen, Tang, Siyu, Sun, Tiancheng
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.14698
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914169568100352
author	Wang, Yiming Chai, Lucy Luo, Xuan Niemeyer, Michael Lagunas, Manuel Lombardi, Stephen Tang, Siyu Sun, Tiancheng
author_facet	Wang, Yiming Chai, Lucy Luo, Xuan Niemeyer, Michael Lagunas, Manuel Lombardi, Stephen Tang, Siyu Sun, Tiancheng
contents	Recent advances in feed-forward 3D Gaussian Splatting have led to rapid improvements in efficient scene reconstruction from sparse views. However, most existing approaches construct Gaussian primitives directly aligned with the pixels in one or more of the input images. This leads to redundancies in the representation when input views overlap and constrains the position of the primitives to lie along the input rays without full flexibility in 3D space. Moreover, these pixel-aligned approaches do not naturally generalize to dynamic scenes, where effectively leveraging temporal information requires resolving both redundant and newly appearing content across frames. To address these limitations, we introduce a novel Fuse-and-Refine module that enhances existing feed-forward models by merging and refining the primitives in a canonical 3D space. At the core of our method is an efficient hybrid Splat-Voxel representation: from an initial set of pixel-aligned Gaussian primitives, we aggregate local features into a coarse-to-fine voxel hierarchy, and then use a sparse voxel transformer to process these voxel features and generate refined Gaussian primitives. By fusing and refining an arbitrary number of inputs into a consistent set of primitives, our representation effectively reduces redundancy and naturally adapts to temporal frames, enabling history-aware online reconstruction of dynamic scenes. Our approach achieves state-of-the-art performance in both static and streaming scene reconstructions while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_14698
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Learning Efficient Fuse-and-Refine for Feed-Forward 3D Gaussian Splatting Wang, Yiming Chai, Lucy Luo, Xuan Niemeyer, Michael Lagunas, Manuel Lombardi, Stephen Tang, Siyu Sun, Tiancheng Computer Vision and Pattern Recognition Recent advances in feed-forward 3D Gaussian Splatting have led to rapid improvements in efficient scene reconstruction from sparse views. However, most existing approaches construct Gaussian primitives directly aligned with the pixels in one or more of the input images. This leads to redundancies in the representation when input views overlap and constrains the position of the primitives to lie along the input rays without full flexibility in 3D space. Moreover, these pixel-aligned approaches do not naturally generalize to dynamic scenes, where effectively leveraging temporal information requires resolving both redundant and newly appearing content across frames. To address these limitations, we introduce a novel Fuse-and-Refine module that enhances existing feed-forward models by merging and refining the primitives in a canonical 3D space. At the core of our method is an efficient hybrid Splat-Voxel representation: from an initial set of pixel-aligned Gaussian primitives, we aggregate local features into a coarse-to-fine voxel hierarchy, and then use a sparse voxel transformer to process these voxel features and generate refined Gaussian primitives. By fusing and refining an arbitrary number of inputs into a consistent set of primitives, our representation effectively reduces redundancy and naturally adapts to temporal frames, enabling history-aware online reconstruction of dynamic scenes. Our approach achieves state-of-the-art performance in both static and streaming scene reconstructions while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU.
title	Learning Efficient Fuse-and-Refine for Feed-Forward 3D Gaussian Splatting
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.14698

Similar Items