Saved in:
Bibliographic Details
Main Authors: Lang, Xiaolei, Wang, Yang, Zhou, Yukun, Ni, Chaojun, Li, Kerui, Zhu, Jiagang, Liu, Tianze, Lv, Jiajun, Zuo, Xingxing, Ye, Yun, Huang, Guan, Wang, Xiaofeng, Zhu, Zheng
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.09330
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911582243520512
author Lang, Xiaolei
Wang, Yang
Zhou, Yukun
Ni, Chaojun
Li, Kerui
Zhu, Jiagang
Liu, Tianze
Lv, Jiajun
Zuo, Xingxing
Ye, Yun
Huang, Guan
Wang, Xiaofeng
Zhu, Zheng
author_facet Lang, Xiaolei
Wang, Yang
Zhou, Yukun
Ni, Chaojun
Li, Kerui
Zhu, Jiagang
Liu, Tianze
Lv, Jiajun
Zuo, Xingxing
Ye, Yun
Huang, Guan
Wang, Xiaofeng
Zhu, Zheng
contents Recent advances in robot foundation models trained on large-scale human teleoperation data have enabled robots to perform increasingly complex real-world tasks. However, scaling these systems remains difficult because collecting task-specific demonstrations is expensive and labor-intensive. Synthetic data, especially generated videos, offer a promising direction, but existing World Models (WMs) are not directly suitable for policy learning since they do not provide paired action trajectories. World-Action (WA) models partially address this by predicting actions with visual outputs, yet often lack strong video-action alignment, while two-stage pipelines that generate video first and then infer actions introduce inefficiency and error accumulation. To address these limitations, we propose VAG, a unified flow-matching-based dual-stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross-modal consistency during generation. Across both simulated and real-world settings, VAG produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization, indicating its potential as a practical world-action model for embodied data synthesis.
format Preprint
id arxiv_https___arxiv_org_abs_2604_09330
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
Lang, Xiaolei
Wang, Yang
Zhou, Yukun
Ni, Chaojun
Li, Kerui
Zhu, Jiagang
Liu, Tianze
Lv, Jiajun
Zuo, Xingxing
Ye, Yun
Huang, Guan
Wang, Xiaofeng
Zhu, Zheng
Robotics
Computer Vision and Pattern Recognition
Recent advances in robot foundation models trained on large-scale human teleoperation data have enabled robots to perform increasingly complex real-world tasks. However, scaling these systems remains difficult because collecting task-specific demonstrations is expensive and labor-intensive. Synthetic data, especially generated videos, offer a promising direction, but existing World Models (WMs) are not directly suitable for policy learning since they do not provide paired action trajectories. World-Action (WA) models partially address this by predicting actions with visual outputs, yet often lack strong video-action alignment, while two-stage pipelines that generate video first and then infer actions introduce inefficiency and error accumulation. To address these limitations, we propose VAG, a unified flow-matching-based dual-stream framework that jointly generates video and action under visual and language conditioning. By synchronizing denoising in both branches and using an adaptive 3D pooling mechanism to transfer compact global video context to the action branch, VAG improves cross-modal consistency during generation. Across both simulated and real-world settings, VAG produces aligned video-action pairs with competitive prediction quality, supports executable trajectory replay, and provides useful synthetic pretraining data that improves downstream policy generalization, indicating its potential as a practical world-action model for embodied data synthesis.
title VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
topic Robotics
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2604.09330