Saved in:
| Main Authors: | Zhu, Ziyu, Zhang, Zhuofan, Ma, Xiaojian, Niu, Xuesong, Chen, Yixin, Jia, Baoxiong, Deng, Zhidong, Huang, Siyuan, Li, Qing |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2405.11442 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
by: Zhu, Ziyu, et al.
Published: (2025)
by: Zhu, Ziyu, et al.
Published: (2025)
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
by: Jia, Baoxiong, et al.
Published: (2024)
by: Jia, Baoxiong, et al.
Published: (2024)
Multi-modal Situated Reasoning in 3D Scenes
by: Linghu, Xiongkun, et al.
Published: (2024)
by: Linghu, Xiongkun, et al.
Published: (2024)
Task-oriented Sequential Grounding and Navigation in 3D Scenes
by: Zhang, Zhuofan, et al.
Published: (2024)
by: Zhang, Zhuofan, et al.
Published: (2024)
LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning
by: Huang, Jiangyong, et al.
Published: (2025)
by: Huang, Jiangyong, et al.
Published: (2025)
Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding
by: Wang, Yan, et al.
Published: (2025)
by: Wang, Yan, et al.
Published: (2025)
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
by: Huang, Jiangyong, et al.
Published: (2025)
by: Huang, Jiangyong, et al.
Published: (2025)
SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
by: Linghu, Xiongkun, et al.
Published: (2025)
by: Linghu, Xiongkun, et al.
Published: (2025)
3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding
by: Linghu, Xiongkun, et al.
Published: (2026)
by: Linghu, Xiongkun, et al.
Published: (2026)
SlotLifter: Slot-guided Feature Lifting for Learning Object-centric Radiance Fields
by: Liu, Yu, et al.
Published: (2024)
by: Liu, Yu, et al.
Published: (2024)
An Embodied Generalist Agent in 3D World
by: Huang, Jiangyong, et al.
Published: (2023)
by: Huang, Jiangyong, et al.
Published: (2023)
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes
by: Wang, Tianxu, et al.
Published: (2025)
by: Wang, Tianxu, et al.
Published: (2025)
PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI
by: Yang, Yandan, et al.
Published: (2024)
by: Yang, Yandan, et al.
Published: (2024)
SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent
by: Yang, Yandan, et al.
Published: (2025)
by: Yang, Yandan, et al.
Published: (2025)
Lifting Unlabeled Internet-level Data for 3D Scene Understanding
by: Chen, Yixin, et al.
Published: (2026)
by: Chen, Yixin, et al.
Published: (2026)
Spatial-Temporal Multi-Scale Quantization for Flexible Motion Generation
by: Wang, Zan, et al.
Published: (2025)
by: Wang, Zan, et al.
Published: (2025)
Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance
by: Wang, Zan, et al.
Published: (2024)
by: Wang, Zan, et al.
Published: (2024)
LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations
by: Lin, Yutang, et al.
Published: (2026)
by: Lin, Yutang, et al.
Published: (2026)
MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans
by: Yu, Huangyue, et al.
Published: (2025)
by: Yu, Huangyue, et al.
Published: (2025)
GWM: Towards Scalable Gaussian World Models for Robotic Manipulation
by: Lu, Guanxing, et al.
Published: (2025)
by: Lu, Guanxing, et al.
Published: (2025)
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
by: Yang, Jie, et al.
Published: (2024)
by: Yang, Jie, et al.
Published: (2024)
ArtGS: Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting
by: Liu, Yu, et al.
Published: (2025)
by: Liu, Yu, et al.
Published: (2025)
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
by: Lu, Ruijie, et al.
Published: (2024)
by: Lu, Ruijie, et al.
Published: (2024)
Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting
by: Guo, Jun, et al.
Published: (2024)
by: Guo, Jun, et al.
Published: (2024)
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners
by: Guo, Ziyu, et al.
Published: (2024)
by: Guo, Ziyu, et al.
Published: (2024)
SparseOccVLA: Bridging Occupancy and Vision-Language Models via Sparse Queries for Unified 4D Scene Understanding and Planning
by: Dang, Chenxu, et al.
Published: (2026)
by: Dang, Chenxu, et al.
Published: (2026)
ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models
by: Zhao, Yanpeng, et al.
Published: (2026)
by: Zhao, Yanpeng, et al.
Published: (2026)
Text Promptable Surgical Instrument Segmentation with Vision-Language Models
by: Zhou, Zijian, et al.
Published: (2023)
by: Zhou, Zijian, et al.
Published: (2023)
ARFlow: Human Action-Reaction Flow Matching with Physical Guidance
by: Jiang, Wentao, et al.
Published: (2025)
by: Jiang, Wentao, et al.
Published: (2025)
Unifying 2D and 3D Vision-Language Understanding
by: Jain, Ayush, et al.
Published: (2025)
by: Jain, Ayush, et al.
Published: (2025)
PhysPart: Physically Plausible Part Completion for Interactable Objects
by: Luo, Rundong, et al.
Published: (2024)
by: Luo, Rundong, et al.
Published: (2024)
WildDet3D: Scaling Promptable 3D Detection in the Wild
by: Huang, Weikai, et al.
Published: (2026)
by: Huang, Weikai, et al.
Published: (2026)
SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting
by: Huang, Yiming, et al.
Published: (2025)
by: Huang, Yiming, et al.
Published: (2025)
Vision-Language Models Provide Promptable Representations for Reinforcement Learning
by: Chen, William, et al.
Published: (2024)
by: Chen, William, et al.
Published: (2024)
3D Vision and Language Pretraining with Large-Scale Synthetic Data
by: Yang, Dejie, et al.
Published: (2024)
by: Yang, Dejie, et al.
Published: (2024)
nnInteractive: Redefining 3D Promptable Segmentation
by: Isensee, Fabian, et al.
Published: (2025)
by: Isensee, Fabian, et al.
Published: (2025)
Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
by: Zhu, Hongyi, et al.
Published: (2024)
by: Zhu, Hongyi, et al.
Published: (2024)
LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models
by: Wang, Jingyi, et al.
Published: (2024)
by: Wang, Jingyi, et al.
Published: (2024)
Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing
by: Shen, Hongyu, et al.
Published: (2025)
by: Shen, Hongyu, et al.
Published: (2025)
VideoArtGS: Building Digital Twins of Articulated Objects from Monocular Video
by: Liu, Yu, et al.
Published: (2025)
by: Liu, Yu, et al.
Published: (2025)
Similar Items
-
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
by: Zhu, Ziyu, et al.
Published: (2025) -
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
by: Jia, Baoxiong, et al.
Published: (2024) -
Multi-modal Situated Reasoning in 3D Scenes
by: Linghu, Xiongkun, et al.
Published: (2024) -
Task-oriented Sequential Grounding and Navigation in 3D Scenes
by: Zhang, Zhuofan, et al.
Published: (2024) -
LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning
by: Huang, Jiangyong, et al.
Published: (2025)