Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.22652 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911560569454592 |
|---|---|
| author | Nguyen, E-Ro Zhang, Yichi Ranasinghe, Kanchana Li, Xiang Ryoo, Michael S. |
| author_facet | Nguyen, E-Ro Zhang, Yichi Ranasinghe, Kanchana Li, Xiang Ryoo, Michael S. |
| contents | We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/ |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2509_22652 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Pixel Motion Diffusion is What We Need for Robot Control Nguyen, E-Ro Zhang, Yichi Ranasinghe, Kanchana Li, Xiang Ryoo, Michael S. Robotics Computer Vision and Pattern Recognition We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/ |
| title | Pixel Motion Diffusion is What We Need for Robot Control |
| topic | Robotics Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2509.22652 |