Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Zhang, Di, Duan, Weicheng, Gu, Dasen, Lu, Hongye, Zhang, Hai, Yu, Hang, Zhao, Junqiao, Chen, Guang
Formato:	Preprint
Publicado:	2026
Materias:	Robotics
Acceso en línea:	https://arxiv.org/abs/2601.22988
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866911411246989312
author	Zhang, Di Duan, Weicheng Gu, Dasen Lu, Hongye Zhang, Hai Yu, Hang Zhao, Junqiao Chen, Guang
author_facet	Zhang, Di Duan, Weicheng Gu, Dasen Lu, Hongye Zhang, Hai Yu, Hang Zhao, Junqiao Chen, Guang
contents	Real-world robotic manipulation demands visuomotor policies capable of robust spatial scene understanding and strong generalization across diverse camera viewpoints. While recent advances in 3D-aware visual representations have shown promise, they still suffer from several key limitations, including reliance on multi-view observations during inference which is impractical in single-view restricted scenarios, incomplete scene modeling that fails to capture holistic and fine-grained geometric structures essential for precise manipulation, and lack of effective policy training strategies to retain and exploit the acquired 3D knowledge. To address these challenges, we present MethodName, a unified representation-policy learning framework for view-generalizable robotic manipulation. MethodName introduces a single-view 3D pretraining paradigm that leverages point cloud reconstruction and feed-forward gaussian splatting under multi-view supervision to learn holistic geometric representations. During policy learning, MethodName performs multi-step distillation to preserve the pretrained geometric understanding and effectively transfer it to manipulation skills. We conduct experiments on 12 RLBench tasks, where our approach outperforms the previous state-of-the-art method by 12.7% in average success rate. Further evaluation on six representative tasks demonstrates strong zero-shot view generalization, with success rate drops of only 22.0% and 29.7% under moderate and large viewpoint shifts respectively, whereas the state-of-the-art method suffers larger decreases of 41.6% and 51.5%.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_22988
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Learning Geometrically-Grounded 3D Visual Representations for View-Generalizable Robotic Manipulation Zhang, Di Duan, Weicheng Gu, Dasen Lu, Hongye Zhang, Hai Yu, Hang Zhao, Junqiao Chen, Guang Robotics Real-world robotic manipulation demands visuomotor policies capable of robust spatial scene understanding and strong generalization across diverse camera viewpoints. While recent advances in 3D-aware visual representations have shown promise, they still suffer from several key limitations, including reliance on multi-view observations during inference which is impractical in single-view restricted scenarios, incomplete scene modeling that fails to capture holistic and fine-grained geometric structures essential for precise manipulation, and lack of effective policy training strategies to retain and exploit the acquired 3D knowledge. To address these challenges, we present MethodName, a unified representation-policy learning framework for view-generalizable robotic manipulation. MethodName introduces a single-view 3D pretraining paradigm that leverages point cloud reconstruction and feed-forward gaussian splatting under multi-view supervision to learn holistic geometric representations. During policy learning, MethodName performs multi-step distillation to preserve the pretrained geometric understanding and effectively transfer it to manipulation skills. We conduct experiments on 12 RLBench tasks, where our approach outperforms the previous state-of-the-art method by 12.7% in average success rate. Further evaluation on six representative tasks demonstrates strong zero-shot view generalization, with success rate drops of only 22.0% and 29.7% under moderate and large viewpoint shifts respectively, whereas the state-of-the-art method suffers larger decreases of 41.6% and 51.5%.
title	Learning Geometrically-Grounded 3D Visual Representations for View-Generalizable Robotic Manipulation
topic	Robotics
url	https://arxiv.org/abs/2601.22988

Ejemplares similares