Saved in:
Bibliographic Details
Main Authors: Tang, George, Agarwal, Aditya, Han, Weiqiao, Darrell, Trevor, Bai, Yutong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.06469
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • We generalize lifting to semantic lifting by incorporating per-view masks that indicate relevant pixels for lifting tasks. These masks are determined by querying corresponding multiscale pixel-aligned feature maps, which are derived from scene representations such as distilled feature fields and feature point clouds. However, storing per-view feature maps rendered from distilled feature fields is impractical, and feature point clouds are expensive to store and query. To enable lightweight on-demand retrieval of pixel-aligned relevance masks, we introduce the Vector-Quantized Feature Field. We demonstrate the effectiveness of the Vector-Quantized Feature Field on complex indoor and outdoor scenes. Semantic lifting, when paired with a Vector-Quantized Feature Field, can unlock a myriad of applications in scene representation and embodied intelligence. Specifically, we showcase how our method enables text-driven localized scene editing and significantly improves the efficiency of embodied question answering.