Saved in:
| Main Authors: | Zhen, Haoyu, Li, Xiaolong, Zhao, Yilin, Zhang, Han, Liu, Sifei, Mo, Kaichun, Gan, Chuang, Radhakrishnan, Subhashree |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.22279 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
3D Aware Region Prompted Vision Language Model
by: Cheng, An-Chieh, et al.
Published: (2025)
by: Cheng, An-Chieh, et al.
Published: (2025)
LITA: Language Instructed Temporal-Localization Assistant
by: Huang, De-An, et al.
Published: (2024)
by: Huang, De-An, et al.
Published: (2024)
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
by: Cheng, An-Chieh, et al.
Published: (2024)
by: Cheng, An-Chieh, et al.
Published: (2024)
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
by: Yang, Chiao-An, et al.
Published: (2025)
by: Yang, Chiao-An, et al.
Published: (2025)
M3: 3D-Spatial MultiModal Memory
by: Zou, Xueyan, et al.
Published: (2025)
by: Zou, Xueyan, et al.
Published: (2025)
RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos
by: Xia, Hongchi, et al.
Published: (2024)
by: Xia, Hongchi, et al.
Published: (2024)
InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph Prior
by: Lin, Chenguo, et al.
Published: (2024)
by: Lin, Chenguo, et al.
Published: (2024)
LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model
by: Yang, Yixuan, et al.
Published: (2024)
by: Yang, Yixuan, et al.
Published: (2024)
Grounded 3D-Aware Spatial Vision-Language Modeling
by: Cheng, An-Chieh, et al.
Published: (2026)
by: Cheng, An-Chieh, et al.
Published: (2026)
Fast Spatial Memory with Elastic Test-Time Training
by: Ma, Ziqiao, et al.
Published: (2026)
by: Ma, Ziqiao, et al.
Published: (2026)
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
by: Heo, Miran, et al.
Published: (2025)
by: Heo, Miran, et al.
Published: (2025)
3D-VLA: A 3D Vision-Language-Action Generative World Model
by: Zhen, Haoyu, et al.
Published: (2024)
by: Zhen, Haoyu, et al.
Published: (2024)
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
by: Huang, De-An, et al.
Published: (2025)
by: Huang, De-An, et al.
Published: (2025)
InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image
by: Li, Jianhui, et al.
Published: (2023)
by: Li, Jianhui, et al.
Published: (2023)
HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data
by: Zhang, Mengqi, et al.
Published: (2024)
by: Zhang, Mengqi, et al.
Published: (2024)
COLMAP-Free 3D Gaussian Splatting
by: Fu, Yang, et al.
Published: (2023)
by: Fu, Yang, et al.
Published: (2023)
TIGER: Text-Instructed 3D Gaussian Retrieval and Coherent Editing
by: Xu, Teng, et al.
Published: (2024)
by: Xu, Teng, et al.
Published: (2024)
InstructGIE: Towards Generalizable Image Editing
by: Meng, Zichong, et al.
Published: (2024)
by: Meng, Zichong, et al.
Published: (2024)
3D-MVP: 3D Multiview Pretraining for Robotic Manipulation
by: Qian, Shengyi, et al.
Published: (2024)
by: Qian, Shengyi, et al.
Published: (2024)
3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning
by: Yang, Yuncong, et al.
Published: (2024)
by: Yang, Yuncong, et al.
Published: (2024)
InstructHumans: Editing Animated 3D Human Textures with Instructions
by: Zhu, Jiayin, et al.
Published: (2024)
by: Zhu, Jiayin, et al.
Published: (2024)
Sentinel: Embodied Cooperative Spatial Reasoning and Planning
by: Lin, Xiangye, et al.
Published: (2026)
by: Lin, Xiangye, et al.
Published: (2026)
LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation
by: Shi, Hengyu, et al.
Published: (2025)
by: Shi, Hengyu, et al.
Published: (2025)
TesserAct: Learning 4D Embodied World Models
by: Zhen, Haoyu, et al.
Published: (2025)
by: Zhen, Haoyu, et al.
Published: (2025)
SIGHT: Synthesizing Image-Text Conditioned and Geometry-Guided 3D Hand-Object Trajectories
by: Gavryushin, Alexey, et al.
Published: (2025)
by: Gavryushin, Alexey, et al.
Published: (2025)
DogLayout: Denoising Diffusion GAN for Discrete and Continuous Layout Generation
by: Gan, Zhaoxing, et al.
Published: (2024)
by: Gan, Zhaoxing, et al.
Published: (2024)
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
by: Jiang, Haiyan, et al.
Published: (2026)
by: Jiang, Haiyan, et al.
Published: (2026)
InstructX: Towards Unified Visual Editing with MLLM Guidance
by: Mou, Chong, et al.
Published: (2025)
by: Mou, Chong, et al.
Published: (2025)
ReLayout: Versatile and Structure-Preserving Design Layout Editing via Relation-Aware Design Reconstruction
by: Lin, Jiawei, et al.
Published: (2026)
by: Lin, Jiawei, et al.
Published: (2026)
Parallel Sequence Modeling via Generalized Spatial Propagation Network
by: Wang, Hongjun, et al.
Published: (2025)
by: Wang, Hongjun, et al.
Published: (2025)
Action Images: End-to-End Policy Learning via Multiview Video Generation
by: Zhen, Haoyu, et al.
Published: (2026)
by: Zhen, Haoyu, et al.
Published: (2026)
Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning
by: Ran, Xingjian, et al.
Published: (2025)
by: Ran, Xingjian, et al.
Published: (2025)
PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design
by: Wei, Jiazhe, et al.
Published: (2025)
by: Wei, Jiazhe, et al.
Published: (2025)
Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model
by: Peng, Jihua, et al.
Published: (2025)
by: Peng, Jihua, et al.
Published: (2025)
Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models
by: Jang, Jaeyun, et al.
Published: (2026)
by: Jang, Jaeyun, et al.
Published: (2026)
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
by: Ma, Wufei, et al.
Published: (2024)
by: Ma, Wufei, et al.
Published: (2024)
Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning
by: Huang, Yibin, et al.
Published: (2025)
by: Huang, Yibin, et al.
Published: (2025)
Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations
by: Yuan, Zhihao, et al.
Published: (2025)
by: Yuan, Zhihao, et al.
Published: (2025)
InstructBrush: Learning Attention-based Instruction Optimization for Image Editing
by: Zhao, Ruoyu, et al.
Published: (2024)
by: Zhao, Ruoyu, et al.
Published: (2024)
InstructVEdit: A Holistic Approach for Instructional Video Editing
by: Zhang, Chi, et al.
Published: (2025)
by: Zhang, Chi, et al.
Published: (2025)
Similar Items
-
3D Aware Region Prompted Vision Language Model
by: Cheng, An-Chieh, et al.
Published: (2025) -
LITA: Language Instructed Temporal-Localization Assistant
by: Huang, De-An, et al.
Published: (2024) -
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
by: Cheng, An-Chieh, et al.
Published: (2024) -
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
by: Yang, Chiao-An, et al.
Published: (2025) -
M3: 3D-Spatial MultiModal Memory
by: Zou, Xueyan, et al.
Published: (2025)