Saved in:
| Main Authors: | Zhang, Zhuofan, Zhu, Ziyu, Li, Junhao, Li, Pengxiang, Wang, Tianxu, Liu, Tengyu, Ma, Xiaojian, Chen, Yixin, Jia, Baoxiong, Huang, Siyuan, Li, Qing |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2408.04034 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes
by: Wang, Tianxu, et al.
Published: (2025)
by: Wang, Tianxu, et al.
Published: (2025)
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
by: Zhu, Ziyu, et al.
Published: (2025)
by: Zhu, Ziyu, et al.
Published: (2025)
Unifying 3D Vision-Language Understanding via Promptable Queries
by: Zhu, Ziyu, et al.
Published: (2024)
by: Zhu, Ziyu, et al.
Published: (2024)
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
by: Jia, Baoxiong, et al.
Published: (2024)
by: Jia, Baoxiong, et al.
Published: (2024)
SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
by: Linghu, Xiongkun, et al.
Published: (2025)
by: Linghu, Xiongkun, et al.
Published: (2025)
LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning
by: Huang, Jiangyong, et al.
Published: (2025)
by: Huang, Jiangyong, et al.
Published: (2025)
Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding
by: Wang, Yan, et al.
Published: (2025)
by: Wang, Yan, et al.
Published: (2025)
Multi-modal Situated Reasoning in 3D Scenes
by: Linghu, Xiongkun, et al.
Published: (2024)
by: Linghu, Xiongkun, et al.
Published: (2024)
MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans
by: Yu, Huangyue, et al.
Published: (2025)
by: Yu, Huangyue, et al.
Published: (2025)
Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance
by: Wang, Zan, et al.
Published: (2024)
by: Wang, Zan, et al.
Published: (2024)
3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding
by: Linghu, Xiongkun, et al.
Published: (2026)
by: Linghu, Xiongkun, et al.
Published: (2026)
PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI
by: Yang, Yandan, et al.
Published: (2024)
by: Yang, Yandan, et al.
Published: (2024)
SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent
by: Yang, Yandan, et al.
Published: (2025)
by: Yang, Yandan, et al.
Published: (2025)
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
by: Huang, Jiangyong, et al.
Published: (2025)
by: Huang, Jiangyong, et al.
Published: (2025)
An Embodied Generalist Agent in 3D World
by: Huang, Jiangyong, et al.
Published: (2023)
by: Huang, Jiangyong, et al.
Published: (2023)
Scaling Up Dynamic Human-Scene Interaction Modeling
by: Jiang, Nan, et al.
Published: (2024)
by: Jiang, Nan, et al.
Published: (2024)
SlotLifter: Slot-guided Feature Lifting for Learning Object-centric Radiance Fields
by: Liu, Yu, et al.
Published: (2024)
by: Liu, Yu, et al.
Published: (2024)
Lifting Unlabeled Internet-level Data for 3D Scene Understanding
by: Chen, Yixin, et al.
Published: (2026)
by: Chen, Yixin, et al.
Published: (2026)
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
by: Lu, Ruijie, et al.
Published: (2024)
by: Lu, Ruijie, et al.
Published: (2024)
Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting
by: Guo, Jun, et al.
Published: (2024)
by: Guo, Jun, et al.
Published: (2024)
Spatial-Temporal Multi-Scale Quantization for Flexible Motion Generation
by: Wang, Zan, et al.
Published: (2025)
by: Wang, Zan, et al.
Published: (2025)
GWM: Towards Scalable Gaussian World Models for Robotic Manipulation
by: Lu, Guanxing, et al.
Published: (2025)
by: Lu, Guanxing, et al.
Published: (2025)
Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation
by: Li, Yuyang, et al.
Published: (2025)
by: Li, Yuyang, et al.
Published: (2025)
Dynamic Motion Blending for Versatile Motion Editing
by: Jiang, Nan, et al.
Published: (2025)
by: Jiang, Nan, et al.
Published: (2025)
GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill
by: Cui, Jieming, et al.
Published: (2025)
by: Cui, Jieming, et al.
Published: (2025)
Autonomous Character-Scene Interaction Synthesis from Text Instruction
by: Jiang, Nan, et al.
Published: (2024)
by: Jiang, Nan, et al.
Published: (2024)
Grasp Multiple Objects with One Hand
by: Li, Yuyang, et al.
Published: (2023)
by: Li, Yuyang, et al.
Published: (2023)
AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents
by: Cui, Jieming, et al.
Published: (2024)
by: Cui, Jieming, et al.
Published: (2024)
3D Scene Change Modeling With Consistent Multi-View Aggregation
by: Zhou, Zirui, et al.
Published: (2025)
by: Zhou, Zirui, et al.
Published: (2025)
ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning
by: Li, Kailin, et al.
Published: (2025)
by: Li, Kailin, et al.
Published: (2025)
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations
by: Li, Puhao, et al.
Published: (2024)
by: Li, Puhao, et al.
Published: (2024)
Taccel: Scaling Up Vision-based Tactile Robotics via High-performance GPU Simulation
by: Li, Yuyang, et al.
Published: (2025)
by: Li, Yuyang, et al.
Published: (2025)
LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations
by: Lin, Yutang, et al.
Published: (2026)
by: Lin, Yutang, et al.
Published: (2026)
Afford-X: Generalizable and Slim Affordance Reasoning for Task-oriented Manipulation
by: Zhu, Xiaomeng, et al.
Published: (2025)
by: Zhu, Xiaomeng, et al.
Published: (2025)
PhyRecon: Physically Plausible Neural Scene Reconstruction
by: Ni, Junfeng, et al.
Published: (2024)
by: Ni, Junfeng, et al.
Published: (2024)
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage
by: Gao, Zhi, et al.
Published: (2024)
by: Gao, Zhi, et al.
Published: (2024)
ArtGS: Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting
by: Liu, Yu, et al.
Published: (2025)
by: Liu, Yu, et al.
Published: (2025)
Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning
by: Li, Pengxiang, et al.
Published: (2025)
by: Li, Pengxiang, et al.
Published: (2025)
ARFlow: Human Action-Reaction Flow Matching with Physical Guidance
by: Jiang, Wentao, et al.
Published: (2025)
by: Jiang, Wentao, et al.
Published: (2025)
DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding
by: Ge, Luzhou, et al.
Published: (2026)
by: Ge, Luzhou, et al.
Published: (2026)
Similar Items
-
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes
by: Wang, Tianxu, et al.
Published: (2025) -
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
by: Zhu, Ziyu, et al.
Published: (2025) -
Unifying 3D Vision-Language Understanding via Promptable Queries
by: Zhu, Ziyu, et al.
Published: (2024) -
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
by: Jia, Baoxiong, et al.
Published: (2024) -
SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
by: Linghu, Xiongkun, et al.
Published: (2025)