Saved in:
Bibliographic Details
Main Authors: Xu, Huilin, Ding, Jian, Xu, Jiakun, Wang, Ruixiang, Chen, Jun, Mai, Jinjie, Fu, Yanwei, Ghanem, Bernard, Xu, Feng, Elhoseiny, Mohamed
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.11296
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915391042748416
author Xu, Huilin
Ding, Jian
Xu, Jiakun
Wang, Ruixiang
Chen, Jun
Mai, Jinjie
Fu, Yanwei
Ghanem, Bernard
Xu, Feng
Elhoseiny, Mohamed
author_facet Xu, Huilin
Ding, Jian
Xu, Jiakun
Wang, Ruixiang
Chen, Jun
Mai, Jinjie
Fu, Yanwei
Ghanem, Bernard
Xu, Feng
Elhoseiny, Mohamed
contents Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements. While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. Specifically, we propose a multi-frame latent prediction strategy that encodes future states in a compressed latent space, preserving task-relevant features. Furthermore, we introduce a unidirectional attention mechanism where video prediction is conditioned on the action, while action prediction remains independent of video prediction. This design allows us to omit video prediction during inference, significantly enhancing efficiency. Experiments on two simulated benchmarks and a real-world setting demonstrate a significant improvement in the success rate over the strong baseline ACT using our method, achieving a \textbf{24.9\%} increase on ALOHA, an \textbf{11.1\%} increase on RoboTwin, and a \textbf{32.5\%} increase in real-world experiments. Our models and code are publicly available at https://github.com/return-sleep/Diffusion_based_imaginative_Coordination.
format Preprint
id arxiv_https___arxiv_org_abs_2507_11296
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Diffusion-Based Imaginative Coordination for Bimanual Manipulation
Xu, Huilin
Ding, Jian
Xu, Jiakun
Wang, Ruixiang
Chen, Jun
Mai, Jinjie
Fu, Yanwei
Ghanem, Bernard
Xu, Feng
Elhoseiny, Mohamed
Robotics
Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements. While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. Specifically, we propose a multi-frame latent prediction strategy that encodes future states in a compressed latent space, preserving task-relevant features. Furthermore, we introduce a unidirectional attention mechanism where video prediction is conditioned on the action, while action prediction remains independent of video prediction. This design allows us to omit video prediction during inference, significantly enhancing efficiency. Experiments on two simulated benchmarks and a real-world setting demonstrate a significant improvement in the success rate over the strong baseline ACT using our method, achieving a \textbf{24.9\%} increase on ALOHA, an \textbf{11.1\%} increase on RoboTwin, and a \textbf{32.5\%} increase in real-world experiments. Our models and code are publicly available at https://github.com/return-sleep/Diffusion_based_imaginative_Coordination.
title Diffusion-Based Imaginative Coordination for Bimanual Manipulation
topic Robotics
url https://arxiv.org/abs/2507.11296