MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Sun, Xinhai, Shi, Xiang, Zou, Menglin, Huang, Wenlong
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Robotics
Accesso online:	https://arxiv.org/abs/2603.20668
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866918401187774464
author	Sun, Xinhai Shi, Xiang Zou, Menglin Huang, Wenlong
author_facet	Sun, Xinhai Shi, Xiang Zou, Menglin Huang, Wenlong
contents	The development of embodied AI systems is increasingly constrained by the availability and structure of physical interaction data. Despite recent advances in vision-language-action (VLA) models, current pipelines suffer from high data collection cost, limited cross-embodiment alignment, and poor transfer from internet-scale visual data to robot control. We propose a region-of-interest (ROI) driven engineering workflow that introduces an egocentric, geometry-grounded data representation. By projecting end-effector poses via forward kinematics (FK) into a single external camera, we derive movement-aligned hand-centric ROIs without requiring wrist-mounted cameras or multi-view systems. Unlike directly downsampling the full frame, ROI is cropped from the original image before resizing, preserving high local information density for contact-critical regions while retaining global context. We present a reproducible pipeline covering calibration, synchronization, ROI generation, deterministic boundary handling, and metadata governance. The resulting representation is embodiment-aligned and viewpoint-normalized, enabling data reuse across heterogeneous robots. We argue that egocentric ROI serves as a practical data abstraction for scalable collection and cross-embodiment learning, bridging internet-scale perception and robot-specific control.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_20668
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	ROI-Driven Foveated Attention for Unified Egocentric Representations in Vision-Language-Action Systems Sun, Xinhai Shi, Xiang Zou, Menglin Huang, Wenlong Robotics The development of embodied AI systems is increasingly constrained by the availability and structure of physical interaction data. Despite recent advances in vision-language-action (VLA) models, current pipelines suffer from high data collection cost, limited cross-embodiment alignment, and poor transfer from internet-scale visual data to robot control. We propose a region-of-interest (ROI) driven engineering workflow that introduces an egocentric, geometry-grounded data representation. By projecting end-effector poses via forward kinematics (FK) into a single external camera, we derive movement-aligned hand-centric ROIs without requiring wrist-mounted cameras or multi-view systems. Unlike directly downsampling the full frame, ROI is cropped from the original image before resizing, preserving high local information density for contact-critical regions while retaining global context. We present a reproducible pipeline covering calibration, synchronization, ROI generation, deterministic boundary handling, and metadata governance. The resulting representation is embodiment-aligned and viewpoint-normalized, enabling data reuse across heterogeneous robots. We argue that egocentric ROI serves as a practical data abstraction for scalable collection and cross-embodiment learning, bridging internet-scale perception and robot-specific control.
title	ROI-Driven Foveated Attention for Unified Egocentric Representations in Vision-Language-Action Systems
topic	Robotics
url	https://arxiv.org/abs/2603.20668

Documenti analoghi