Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nguyen, E-Ro, Zhang, Yichi, Ranasinghe, Kanchana, Li, Xiang, Ryoo, Michael S.
Format:	Preprint
Published:	2025
Subjects:	Robotics Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2509.22652
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911560569454592
author	Nguyen, E-Ro Zhang, Yichi Ranasinghe, Kanchana Li, Xiang Ryoo, Michael S.
author_facet	Nguyen, E-Ro Zhang, Yichi Ranasinghe, Kanchana Li, Xiang Ryoo, Michael S.
contents	We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_22652
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Pixel Motion Diffusion is What We Need for Robot Control Nguyen, E-Ro Zhang, Yichi Ranasinghe, Kanchana Li, Xiang Ryoo, Michael S. Robotics Computer Vision and Pattern Recognition We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/
title	Pixel Motion Diffusion is What We Need for Robot Control
topic	Robotics Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2509.22652

Similar Items