:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Hao, Wei, Xiaobao, He, Jingyang, Bai, Chengyu, Fan, Chun-Kai, Cao, Jiajun, Chen, Jintao, Li, Ying, Rong, Shanyu, Lu, Ming, Ju, Xiaozhu, Tang, Jian, Zhang, Shanghang
Format:	Preprint
Published:	2026
Subjects:	Robotics
Online Access:	https://arxiv.org/abs/2605.10485
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying
by: Bai, Chengyu, et al.
Published: (2025)

SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics
by: Liu, Mengzhen, et al.
Published: (2026)

TC-IDM: Grounding Video Generation for Executable Zero-shot Robot Motion
by: Mi, Weishi, et al.
Published: (2026)

EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler
by: Wang, Hao, et al.
Published: (2025)

ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance
by: Li, Ying, et al.
Published: (2025)

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders
by: Cao, Jiajun, et al.
Published: (2025)

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
by: Zhou, Enshen, et al.
Published: (2025)

Action-Sketcher: From Reasoning to Action via Visual Sketches for Long-Horizon Robotic Manipulation
by: Tan, Huajie, et al.
Published: (2026)

I-MedSAM: Implicit Medical Image Segmentation with Segment Anything
by: Wei, Xiaobao, et al.
Published: (2023)

FastInit: Fast Noise Initialization for Temporally Consistent Video Generation
by: Bai, Chengyu, et al.
Published: (2025)

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
by: Zhou, Enshen, et al.
Published: (2025)

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics
by: Han, Yi, et al.
Published: (2025)

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models
by: Bai, Yu, et al.
Published: (2026)

Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
by: Chen, Jintao, et al.
Published: (2026)

Grounding Emotion Recognition with Visual Prototypes: VEGA -- Revisiting CLIP in MERC
by: Hu, Guanyu, et al.
Published: (2025)

RoboArmGS: High-Quality Robotic Arm Splatting via Bézier Curve Refinement
by: Wang, Hao, et al.
Published: (2025)

StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
by: Chen, Yilong, et al.
Published: (2025)

Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models
by: Bachu, Saketh, et al.
Published: (2024)

EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks
by: Zhang, Yi, et al.
Published: (2025)

EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models
by: Shan, Haozhe, et al.
Published: (2026)

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
by: Zhang, Zhengshen, et al.
Published: (2025)

ConceptWeaver: Weaving Disentangled Concepts with Flow
by: Chen, Jintao, et al.
Published: (2026)

WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation
by: Qian, Zezhong, et al.
Published: (2025)

ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models
by: Sun, Guoheng, et al.
Published: (2026)

VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models
by: Cheng, Jintao, et al.
Published: (2026)

Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment
by: Wulff, Theodor, et al.
Published: (2026)

MC-LLaVA: Multi-Concept Personalized Vision-Language Model
by: An, Ruichuan, et al.
Published: (2025)

Equilibrium in Style: A Modeling Framework on the Cash Flow and the Life Cycle of a Consumer Store
by: Han, Shanyu, et al.
Published: (2024)

Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners
by: Feng, Chun, et al.
Published: (2024)

Reshaping Action Error Distributions for Reliable Vision-Language-Action Models
by: Bai, Shuanghao, et al.
Published: (2026)

Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models
by: Zhu, Yingjie, et al.
Published: (2025)

SGANet: Semantic and Geometric Alignment for Multimodal Multi-view Anomaly Detection
by: Bai, Letian, et al.
Published: (2026)

MC-LLaVA: Multi-Concept Personalized Vision-Language Model
by: An, Ruichuan, et al.
Published: (2024)

Efficient Training of Generalizable Visuomotor Policies via Control-Aware Augmentation
by: Zhao, Yinuo, et al.
Published: (2024)

NTO3D: Neural Target Object 3D Reconstruction with Segment Anything
by: Wei, Xiaobao, et al.
Published: (2023)

OmniIndoor3D: Comprehensive Indoor 3D Reconstruction
by: Wei, Xiaobao, et al.
Published: (2025)

Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
by: Xie, Jiangnan, et al.
Published: (2025)

SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning
by: Pan, Xu, et al.
Published: (2026)

Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding
by: Xu, Yue, et al.
Published: (2024)

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing
by: Shi, Haoyuan, et al.
Published: (2026)