Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Yanbang, Gong, Ziyang, Li, Haoyang, Huang, Xiaoqi, Kang, Haolan, Bai, Guangping, Ma, Xianzheng
Format:	Preprint
Published:	2025
Subjects:	Robotics Artificial Intelligence Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.00693
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912503619911680
author	Li, Yanbang Gong, Ziyang Li, Haoyang Huang, Xiaoqi Kang, Haolan Bai, Guangping Ma, Xianzheng
author_facet	Li, Yanbang Gong, Ziyang Li, Haoyang Huang, Xiaoqi Kang, Haolan Bai, Guangping Ma, Xianzheng
contents	Recently, natural language has been the primary medium for human-robot interaction. However, its inherent lack of spatial precision introduces challenges for robotic task definition such as ambiguity and verbosity. Moreover, in some public settings where quiet is required, such as libraries or hospitals, verbal communication with robots is inappropriate. To address these limitations, we introduce the Robotic Visual Instruction (RoVI), a novel paradigm to guide robotic tasks through an object-centric, hand-drawn symbolic representation. RoVI effectively encodes spatial-temporal information into human-interpretable visual instructions through 2D sketches, utilizing arrows, circles, colors, and numbers to direct 3D robotic manipulation. To enable robots to understand RoVI better and generate precise actions based on RoVI, we present Visual Instruction Embodied Workflow (VIEW), a pipeline formulated for RoVI-conditioned policies. This approach leverages Vision-Language Models (VLMs) to interpret RoVI inputs, decode spatial and temporal constraints from 2D pixel space via keypoint extraction, and then transform them into executable 3D action sequences. We additionally curate a specialized dataset of 15K instances to fine-tune small VLMs for edge deployment,enabling them to effectively learn RoVI capabilities. Our approach is rigorously validated across 11 novel tasks in both real and simulated environments, demonstrating significant generalization capability. Notably, VIEW achieves an 87.5% success rate in real-world scenarios involving unseen tasks that feature multi-step actions, with disturbances, and trajectory-following requirements. Project website: https://robotic-visual-instruction.github.io/
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_00693
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Robotic Visual Instruction Li, Yanbang Gong, Ziyang Li, Haoyang Huang, Xiaoqi Kang, Haolan Bai, Guangping Ma, Xianzheng Robotics Artificial Intelligence Computer Vision and Pattern Recognition Recently, natural language has been the primary medium for human-robot interaction. However, its inherent lack of spatial precision introduces challenges for robotic task definition such as ambiguity and verbosity. Moreover, in some public settings where quiet is required, such as libraries or hospitals, verbal communication with robots is inappropriate. To address these limitations, we introduce the Robotic Visual Instruction (RoVI), a novel paradigm to guide robotic tasks through an object-centric, hand-drawn symbolic representation. RoVI effectively encodes spatial-temporal information into human-interpretable visual instructions through 2D sketches, utilizing arrows, circles, colors, and numbers to direct 3D robotic manipulation. To enable robots to understand RoVI better and generate precise actions based on RoVI, we present Visual Instruction Embodied Workflow (VIEW), a pipeline formulated for RoVI-conditioned policies. This approach leverages Vision-Language Models (VLMs) to interpret RoVI inputs, decode spatial and temporal constraints from 2D pixel space via keypoint extraction, and then transform them into executable 3D action sequences. We additionally curate a specialized dataset of 15K instances to fine-tune small VLMs for edge deployment,enabling them to effectively learn RoVI capabilities. Our approach is rigorously validated across 11 novel tasks in both real and simulated environments, demonstrating significant generalization capability. Notably, VIEW achieves an 87.5% success rate in real-world scenarios involving unseen tasks that feature multi-step actions, with disturbances, and trajectory-following requirements. Project website: https://robotic-visual-instruction.github.io/
title	Robotic Visual Instruction
topic	Robotics Artificial Intelligence Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2505.00693

Similar Items