Saved in:
| Main Authors: | Lin, Haitao, Yu, Hanyang, Huang, Jingshun, Zhang, He, Ling, Yonggen, Tan, Ping, Xue, Xiangyang, Fu, Yanwei |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.19710 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image
by: Huang, Jingshun, et al.
Published: (2025)
by: Huang, Jingshun, et al.
Published: (2025)
You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping
by: Huang, Jingshun, et al.
Published: (2025)
by: Huang, Jingshun, et al.
Published: (2025)
ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation
by: Liu, Zhenyang, et al.
Published: (2026)
by: Liu, Zhenyang, et al.
Published: (2026)
SCOOP'D: Learning Mixed-Liquid-Solid Scooping via Sim2Real Generative Policy
by: Wang, Kuanning, et al.
Published: (2025)
by: Wang, Kuanning, et al.
Published: (2025)
Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models
by: Wang, Tianyu, et al.
Published: (2024)
by: Wang, Tianyu, et al.
Published: (2024)
OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer
by: Wang, Kuanning, et al.
Published: (2026)
by: Wang, Kuanning, et al.
Published: (2026)
TP-MDDN: Task-Preferenced Multi-Demand-Driven Navigation with Autonomous Decision-Making
by: Li, Shanshan, et al.
Published: (2025)
by: Li, Shanshan, et al.
Published: (2025)
LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion
by: Zhang, Jinyu, et al.
Published: (2024)
by: Zhang, Jinyu, et al.
Published: (2024)
Constraint-Aware Zero-Shot Vision-Language Navigation in Continuous Environments
by: Chen, Kehan, et al.
Published: (2024)
by: Chen, Kehan, et al.
Published: (2024)
TriVLA: A Triple-System-Based Unified Vision-Language-Action Model with Episodic World Modeling for General Robot Control
by: Liu, Zhenyang, et al.
Published: (2025)
by: Liu, Zhenyang, et al.
Published: (2025)
Beyond 'Templates': Category-Agnostic Object Pose, Size, and Shape Estimation from a Single View
by: Zhang, Jinyu, et al.
Published: (2025)
by: Zhang, Jinyu, et al.
Published: (2025)
SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images
by: Yu, Junqiu, et al.
Published: (2024)
by: Yu, Junqiu, et al.
Published: (2024)
Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents
by: Zhang, Zhizhen, et al.
Published: (2025)
by: Zhang, Zhizhen, et al.
Published: (2025)
GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning
by: Ma, Guoqing, et al.
Published: (2026)
by: Ma, Guoqing, et al.
Published: (2026)
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
by: Lin, Yihan, et al.
Published: (2026)
by: Lin, Yihan, et al.
Published: (2026)
Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming
by: Tong, Baoshun, et al.
Published: (2026)
by: Tong, Baoshun, et al.
Published: (2026)
Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy
by: Zhang, Tianyi, et al.
Published: (2025)
by: Zhang, Tianyi, et al.
Published: (2025)
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding
by: Liu, Zhenyang, et al.
Published: (2025)
by: Liu, Zhenyang, et al.
Published: (2025)
VLS: Steering Pretrained Robot Policies via Vision-Language Models
by: Liu, Shuo, et al.
Published: (2026)
by: Liu, Shuo, et al.
Published: (2026)
CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos
by: Zhang, Chubin, et al.
Published: (2026)
by: Zhang, Chubin, et al.
Published: (2026)
ViT-VS: On the Applicability of Pretrained Vision Transformer Features for Generalizable Visual Servoing
by: Scherl, Alessandro, et al.
Published: (2025)
by: Scherl, Alessandro, et al.
Published: (2025)
Spatial-Temporal Aware Visuomotor Diffusion Policy Learning
by: Liu, Zhenyang, et al.
Published: (2025)
by: Liu, Zhenyang, et al.
Published: (2025)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
by: Hou, Zhi, et al.
Published: (2025)
by: Hou, Zhi, et al.
Published: (2025)
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
by: Liang, Zhixuan, et al.
Published: (2025)
by: Liang, Zhixuan, et al.
Published: (2025)
DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation
by: Su, Taiyi, et al.
Published: (2026)
by: Su, Taiyi, et al.
Published: (2026)
LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment
by: Nie, Dujun, et al.
Published: (2026)
by: Nie, Dujun, et al.
Published: (2026)
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model
by: Feng, Yixu, et al.
Published: (2026)
by: Feng, Yixu, et al.
Published: (2026)
Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance
by: Wang, Runze, et al.
Published: (2026)
by: Wang, Runze, et al.
Published: (2026)
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
by: Chen, Xinyi, et al.
Published: (2025)
by: Chen, Xinyi, et al.
Published: (2025)
Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations
by: Grover, Shresth, et al.
Published: (2025)
by: Grover, Shresth, et al.
Published: (2025)
Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy
by: Xu, Kechun, et al.
Published: (2025)
by: Xu, Kechun, et al.
Published: (2025)
MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training
by: Yin, Zhenhan, et al.
Published: (2025)
by: Yin, Zhenhan, et al.
Published: (2025)
Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy
by: Garcia, Ricardo, et al.
Published: (2024)
by: Garcia, Ricardo, et al.
Published: (2024)
Generalizable Humanoid Manipulation with 3D Diffusion Policies
by: Ze, Yanjie, et al.
Published: (2024)
by: Ze, Yanjie, et al.
Published: (2024)
Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
by: Li, Qixiu, et al.
Published: (2025)
by: Li, Qixiu, et al.
Published: (2025)
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
by: Huang, Huang, et al.
Published: (2025)
by: Huang, Huang, et al.
Published: (2025)
Self-Improving Vision-Language-Action Models with Data Generation via Residual RL
by: Xiao, Wenli, et al.
Published: (2025)
by: Xiao, Wenli, et al.
Published: (2025)
Latent Action Pretraining from Videos
by: Ye, Seonghyeon, et al.
Published: (2024)
by: Ye, Seonghyeon, et al.
Published: (2024)
Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos
by: Luo, Hao, et al.
Published: (2025)
by: Luo, Hao, et al.
Published: (2025)
LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior
by: Wang, Xinkai, et al.
Published: (2026)
by: Wang, Xinkai, et al.
Published: (2026)
Similar Items
-
CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image
by: Huang, Jingshun, et al.
Published: (2025) -
You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping
by: Huang, Jingshun, et al.
Published: (2025) -
ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation
by: Liu, Zhenyang, et al.
Published: (2026) -
SCOOP'D: Learning Mixed-Liquid-Solid Scooping via Sim2Real Generative Policy
by: Wang, Kuanning, et al.
Published: (2025) -
Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models
by: Wang, Tianyu, et al.
Published: (2024)