Saved in:
Bibliographic Details
Main Authors: Nguyen, E-Ro, Zhang, Yichi, Ranasinghe, Kanchana, Li, Xiang, Ryoo, Michael S.
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.22652
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911560569454592
author Nguyen, E-Ro
Zhang, Yichi
Ranasinghe, Kanchana
Li, Xiang
Ryoo, Michael S.
author_facet Nguyen, E-Ro
Zhang, Yichi
Ranasinghe, Kanchana
Li, Xiang
Ryoo, Michael S.
contents We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/
format Preprint
id arxiv_https___arxiv_org_abs_2509_22652
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Pixel Motion Diffusion is What We Need for Robot Control
Nguyen, E-Ro
Zhang, Yichi
Ranasinghe, Kanchana
Li, Xiang
Ryoo, Michael S.
Robotics
Computer Vision and Pattern Recognition
We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/
title Pixel Motion Diffusion is What We Need for Robot Control
topic Robotics
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2509.22652