:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhu, Ziyu, Zhang, Zhuofan, Ma, Xiaojian, Niu, Xuesong, Chen, Yixin, Jia, Baoxiong, Deng, Zhidong, Huang, Siyuan, Li, Qing
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2405.11442
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
by: Zhu, Ziyu, et al.
Published: (2025)

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
by: Jia, Baoxiong, et al.
Published: (2024)

Multi-modal Situated Reasoning in 3D Scenes
by: Linghu, Xiongkun, et al.
Published: (2024)

Task-oriented Sequential Grounding and Navigation in 3D Scenes
by: Zhang, Zhuofan, et al.
Published: (2024)

LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning
by: Huang, Jiangyong, et al.
Published: (2025)

Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding
by: Wang, Yan, et al.
Published: (2025)

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
by: Huang, Jiangyong, et al.
Published: (2025)

SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
by: Linghu, Xiongkun, et al.
Published: (2025)

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding
by: Linghu, Xiongkun, et al.
Published: (2026)

SlotLifter: Slot-guided Feature Lifting for Learning Object-centric Radiance Fields
by: Liu, Yu, et al.
Published: (2024)

An Embodied Generalist Agent in 3D World
by: Huang, Jiangyong, et al.
Published: (2023)

From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes
by: Wang, Tianxu, et al.
Published: (2025)

PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI
by: Yang, Yandan, et al.
Published: (2024)

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent
by: Yang, Yandan, et al.
Published: (2025)

Lifting Unlabeled Internet-level Data for 3D Scene Understanding
by: Chen, Yixin, et al.
Published: (2026)

Spatial-Temporal Multi-Scale Quantization for Flexible Motion Generation
by: Wang, Zan, et al.
Published: (2025)

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance
by: Wang, Zan, et al.
Published: (2024)

LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations
by: Lin, Yutang, et al.
Published: (2026)

MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans
by: Yu, Huangyue, et al.
Published: (2025)

GWM: Towards Scalable Gaussian World Models for Robotic Manipulation
by: Lu, Guanxing, et al.
Published: (2025)

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
by: Yang, Jie, et al.
Published: (2024)

ArtGS: Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting
by: Liu, Yu, et al.
Published: (2025)

MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
by: Lu, Ruijie, et al.
Published: (2024)

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting
by: Guo, Jun, et al.
Published: (2024)

SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners
by: Guo, Ziyu, et al.
Published: (2024)

SparseOccVLA: Bridging Occupancy and Vision-Language Models via Sparse Queries for Unified 4D Scene Understanding and Planning
by: Dang, Chenxu, et al.
Published: (2026)

ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models
by: Zhao, Yanpeng, et al.
Published: (2026)

Text Promptable Surgical Instrument Segmentation with Vision-Language Models
by: Zhou, Zijian, et al.
Published: (2023)

ARFlow: Human Action-Reaction Flow Matching with Physical Guidance
by: Jiang, Wentao, et al.
Published: (2025)

Unifying 2D and 3D Vision-Language Understanding
by: Jain, Ayush, et al.
Published: (2025)

PhysPart: Physically Plausible Part Completion for Interactable Objects
by: Luo, Rundong, et al.
Published: (2024)

WildDet3D: Scaling Promptable 3D Detection in the Wild
by: Huang, Weikai, et al.
Published: (2026)

SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting
by: Huang, Yiming, et al.
Published: (2025)

Vision-Language Models Provide Promptable Representations for Reinforcement Learning
by: Chen, William, et al.
Published: (2024)

3D Vision and Language Pretraining with Large-Scale Synthetic Data
by: Yang, Dejie, et al.
Published: (2024)

nnInteractive: Redefining 3D Promptable Segmentation
by: Isensee, Fabian, et al.
Published: (2025)

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
by: Zhu, Hongyi, et al.
Published: (2024)

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models
by: Wang, Jingyi, et al.
Published: (2024)

Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing
by: Shen, Hongyu, et al.
Published: (2025)

VideoArtGS: Building Digital Twins of Articulated Objects from Monocular Video
by: Liu, Yu, et al.
Published: (2025)