Saved in:
Bibliographic Details
Main Authors: Cao, Ang, Arnaud, Sergio, Maksymets, Oleksandr, Yang, Jianing, Jain, Ayush, Yenamandra, Sriram, Martin, Ada, Berges, Vincent-Pierre, McVay, Paul, Partsey, Ruslan, Rajeswaran, Aravind, Meier, Franziska, Johnson, Justin, Park, Jeong Joon, Sax, Alexander
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.20389
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916783551676416
author Cao, Ang
Arnaud, Sergio
Maksymets, Oleksandr
Yang, Jianing
Jain, Ayush
Yenamandra, Sriram
Martin, Ada
Berges, Vincent-Pierre
McVay, Paul
Partsey, Ruslan
Rajeswaran, Aravind
Meier, Franziska
Johnson, Justin
Park, Jeong Joon
Sax, Alexander
author_facet Cao, Ang
Arnaud, Sergio
Maksymets, Oleksandr
Yang, Jianing
Jain, Ayush
Yenamandra, Sriram
Martin, Ada
Berges, Vincent-Pierre
McVay, Paul
Partsey, Ruslan
Rajeswaran, Aravind
Meier, Franziska
Johnson, Justin
Park, Jeong Joon
Sax, Alexander
contents 3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes--a six-order-of-magnitude gap that severely limits performance. We introduce $\textbf{LIFT-GS}$, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with $25.7\%$ mAP on open-vocabulary instance segmentation (vs. $20.2\%$ prior SOTA) and consistent $10-30\%$ improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2X, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: https://liftgs.github.io
format Preprint
id arxiv_https___arxiv_org_abs_2502_20389
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs
Cao, Ang
Arnaud, Sergio
Maksymets, Oleksandr
Yang, Jianing
Jain, Ayush
Yenamandra, Sriram
Martin, Ada
Berges, Vincent-Pierre
McVay, Paul
Partsey, Ruslan
Rajeswaran, Aravind
Meier, Franziska
Johnson, Justin
Park, Jeong Joon
Sax, Alexander
Computer Vision and Pattern Recognition
3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes--a six-order-of-magnitude gap that severely limits performance. We introduce $\textbf{LIFT-GS}$, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with $25.7\%$ mAP on open-vocabulary instance segmentation (vs. $20.2\%$ prior SOTA) and consistent $10-30\%$ improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2X, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: https://liftgs.github.io
title From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2502.20389