Saved in:
Bibliographic Details
Main Authors: Song, Weixi, Chen, Zhetao, Xu, Tao, Zeng, Xianchao, Zhou, Xinyu, Yang, Lixin, Wang, Donglin, Lu, Cewu, Li, Yong-Lu
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.17898
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909917227515904
author Song, Weixi
Chen, Zhetao
Xu, Tao
Zeng, Xianchao
Zhou, Xinyu
Yang, Lixin
Wang, Donglin
Lu, Cewu
Li, Yong-Lu
author_facet Song, Weixi
Chen, Zhetao
Xu, Tao
Zeng, Xianchao
Zhou, Xinyu
Yang, Lixin
Wang, Donglin
Lu, Cewu
Li, Yong-Lu
contents Denoising-based models, such as diffusion and flow matching, have been a critical component of robotic manipulation for their strong distribution-fitting and scaling capacity. Concurrently, several works have demonstrated that simple learning objectives, such as L1 regression, can achieve performance comparable to denoising-based methods on certain tasks, while offering faster convergence and inference. In this paper, we focus on how to combine the advantages of these two paradigms: retaining the ability of denoising models to capture multi-modal distributions and avoid mode collapse while achieving the efficiency of the L1 regression objective. To achieve this vision, we reformulate the original v-prediction flow matching and transform it into sample-prediction with the L1 training objective. We empirically show that the multi-modality can be expressed via a single ODE step. Thus, we propose \textbf{L1 Flow}, a two-step sampling schedule that generates a suboptimal action sequence via a single integration step and then reconstructs the precise action sequence through a single prediction. The proposed method largely retains the advantages of flow matching while reducing the iterative neural function evaluations to merely two and mitigating the potential performance degradation associated with direct sample regression. We evaluate our method with varying baselines and benchmarks, including 8 tasks in MimicGen, 5 tasks in RoboMimic \& PushT Bench, and one task in the real-world scenario. The results show the advantages of the proposed method with regard to training efficiency, inference speed, and overall performance. \href{https://song-wx.github.io/l1flow.github.io/}{Project Website.}
format Preprint
id arxiv_https___arxiv_org_abs_2511_17898
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle L1 Sample Flow for Efficient Visuomotor Learning
Song, Weixi
Chen, Zhetao
Xu, Tao
Zeng, Xianchao
Zhou, Xinyu
Yang, Lixin
Wang, Donglin
Lu, Cewu
Li, Yong-Lu
Robotics
Denoising-based models, such as diffusion and flow matching, have been a critical component of robotic manipulation for their strong distribution-fitting and scaling capacity. Concurrently, several works have demonstrated that simple learning objectives, such as L1 regression, can achieve performance comparable to denoising-based methods on certain tasks, while offering faster convergence and inference. In this paper, we focus on how to combine the advantages of these two paradigms: retaining the ability of denoising models to capture multi-modal distributions and avoid mode collapse while achieving the efficiency of the L1 regression objective. To achieve this vision, we reformulate the original v-prediction flow matching and transform it into sample-prediction with the L1 training objective. We empirically show that the multi-modality can be expressed via a single ODE step. Thus, we propose \textbf{L1 Flow}, a two-step sampling schedule that generates a suboptimal action sequence via a single integration step and then reconstructs the precise action sequence through a single prediction. The proposed method largely retains the advantages of flow matching while reducing the iterative neural function evaluations to merely two and mitigating the potential performance degradation associated with direct sample regression. We evaluate our method with varying baselines and benchmarks, including 8 tasks in MimicGen, 5 tasks in RoboMimic \& PushT Bench, and one task in the real-world scenario. The results show the advantages of the proposed method with regard to training efficiency, inference speed, and overall performance. \href{https://song-wx.github.io/l1flow.github.io/}{Project Website.}
title L1 Sample Flow for Efficient Visuomotor Learning
topic Robotics
url https://arxiv.org/abs/2511.17898