Saved in:
| Main Authors: | Wang, Hao, Wei, Xiaobao, He, Jingyang, Bai, Chengyu, Fan, Chun-Kai, Cao, Jiajun, Chen, Jintao, Li, Ying, Rong, Shanyu, Lu, Ming, Ju, Xiaozhu, Tang, Jian, Zhang, Shanghang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.10485 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying
by: Bai, Chengyu, et al.
Published: (2025)
by: Bai, Chengyu, et al.
Published: (2025)
SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics
by: Liu, Mengzhen, et al.
Published: (2026)
by: Liu, Mengzhen, et al.
Published: (2026)
TC-IDM: Grounding Video Generation for Executable Zero-shot Robot Motion
by: Mi, Weishi, et al.
Published: (2026)
by: Mi, Weishi, et al.
Published: (2026)
EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler
by: Wang, Hao, et al.
Published: (2025)
by: Wang, Hao, et al.
Published: (2025)
ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance
by: Li, Ying, et al.
Published: (2025)
by: Li, Ying, et al.
Published: (2025)
MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders
by: Cao, Jiajun, et al.
Published: (2025)
by: Cao, Jiajun, et al.
Published: (2025)
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
by: Zhou, Enshen, et al.
Published: (2025)
by: Zhou, Enshen, et al.
Published: (2025)
Action-Sketcher: From Reasoning to Action via Visual Sketches for Long-Horizon Robotic Manipulation
by: Tan, Huajie, et al.
Published: (2026)
by: Tan, Huajie, et al.
Published: (2026)
I-MedSAM: Implicit Medical Image Segmentation with Segment Anything
by: Wei, Xiaobao, et al.
Published: (2023)
by: Wei, Xiaobao, et al.
Published: (2023)
FastInit: Fast Noise Initialization for Temporally Consistent Video Generation
by: Bai, Chengyu, et al.
Published: (2025)
by: Bai, Chengyu, et al.
Published: (2025)
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
by: Zhou, Enshen, et al.
Published: (2025)
by: Zhou, Enshen, et al.
Published: (2025)
TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics
by: Han, Yi, et al.
Published: (2025)
by: Han, Yi, et al.
Published: (2025)
EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models
by: Bai, Yu, et al.
Published: (2026)
by: Bai, Yu, et al.
Published: (2026)
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
by: Chen, Jintao, et al.
Published: (2026)
by: Chen, Jintao, et al.
Published: (2026)
Grounding Emotion Recognition with Visual Prototypes: VEGA -- Revisiting CLIP in MERC
by: Hu, Guanyu, et al.
Published: (2025)
by: Hu, Guanyu, et al.
Published: (2025)
RoboArmGS: High-Quality Robotic Arm Splatting via Bézier Curve Refinement
by: Wang, Hao, et al.
Published: (2025)
by: Wang, Hao, et al.
Published: (2025)
StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
by: Chen, Yilong, et al.
Published: (2025)
by: Chen, Yilong, et al.
Published: (2025)
Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models
by: Bachu, Saketh, et al.
Published: (2024)
by: Bachu, Saketh, et al.
Published: (2024)
EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks
by: Zhang, Yi, et al.
Published: (2025)
by: Zhang, Yi, et al.
Published: (2025)
EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models
by: Shan, Haozhe, et al.
Published: (2026)
by: Shan, Haozhe, et al.
Published: (2026)
From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
by: Zhang, Zhengshen, et al.
Published: (2025)
by: Zhang, Zhengshen, et al.
Published: (2025)
ConceptWeaver: Weaving Disentangled Concepts with Flow
by: Chen, Jintao, et al.
Published: (2026)
by: Chen, Jintao, et al.
Published: (2026)
WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation
by: Qian, Zezhong, et al.
Published: (2025)
by: Qian, Zezhong, et al.
Published: (2025)
ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models
by: Sun, Guoheng, et al.
Published: (2026)
by: Sun, Guoheng, et al.
Published: (2026)
VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models
by: Cheng, Jintao, et al.
Published: (2026)
by: Cheng, Jintao, et al.
Published: (2026)
Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment
by: Wulff, Theodor, et al.
Published: (2026)
by: Wulff, Theodor, et al.
Published: (2026)
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
by: An, Ruichuan, et al.
Published: (2025)
by: An, Ruichuan, et al.
Published: (2025)
Equilibrium in Style: A Modeling Framework on the Cash Flow and the Life Cycle of a Consumer Store
by: Han, Shanyu, et al.
Published: (2024)
by: Han, Shanyu, et al.
Published: (2024)
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners
by: Feng, Chun, et al.
Published: (2024)
by: Feng, Chun, et al.
Published: (2024)
Reshaping Action Error Distributions for Reliable Vision-Language-Action Models
by: Bai, Shuanghao, et al.
Published: (2026)
by: Bai, Shuanghao, et al.
Published: (2026)
Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models
by: Zhu, Yingjie, et al.
Published: (2025)
by: Zhu, Yingjie, et al.
Published: (2025)
SGANet: Semantic and Geometric Alignment for Multimodal Multi-view Anomaly Detection
by: Bai, Letian, et al.
Published: (2026)
by: Bai, Letian, et al.
Published: (2026)
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
by: An, Ruichuan, et al.
Published: (2024)
by: An, Ruichuan, et al.
Published: (2024)
Efficient Training of Generalizable Visuomotor Policies via Control-Aware Augmentation
by: Zhao, Yinuo, et al.
Published: (2024)
by: Zhao, Yinuo, et al.
Published: (2024)
NTO3D: Neural Target Object 3D Reconstruction with Segment Anything
by: Wei, Xiaobao, et al.
Published: (2023)
by: Wei, Xiaobao, et al.
Published: (2023)
OmniIndoor3D: Comprehensive Indoor 3D Reconstruction
by: Wei, Xiaobao, et al.
Published: (2025)
by: Wei, Xiaobao, et al.
Published: (2025)
Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
by: Xie, Jiangnan, et al.
Published: (2025)
by: Xie, Jiangnan, et al.
Published: (2025)
SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning
by: Pan, Xu, et al.
Published: (2026)
by: Pan, Xu, et al.
Published: (2026)
Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding
by: Xu, Yue, et al.
Published: (2024)
by: Xu, Yue, et al.
Published: (2024)
VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing
by: Shi, Haoyuan, et al.
Published: (2026)
by: Shi, Haoyuan, et al.
Published: (2026)
Similar Items
-
UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying
by: Bai, Chengyu, et al.
Published: (2025) -
SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics
by: Liu, Mengzhen, et al.
Published: (2026) -
TC-IDM: Grounding Video Generation for Executable Zero-shot Robot Motion
by: Mi, Weishi, et al.
Published: (2026) -
EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler
by: Wang, Hao, et al.
Published: (2025) -
ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance
by: Li, Ying, et al.
Published: (2025)