Saved in:
Bibliographic Details
Main Authors: Zhang, Kaidong, Zhang, Jian, Xu, Rongtao, Sun, Yu, Xue, Shuoshuo, Wen, Youpeng, Guo, Xiaoyu, Guo, Minghao, Liufu, Weijia, Zihou, Liu, Ji, Kangyi, Zhang, Yangsong, Zhu, Jiarun, Liu, Jingzhi, Li, Zihang, Chen, Ruiyi, Cao, Meng, Zhang, Jingming, Zhao, Shen, Chang, Xiaojun, Zheng, Feng, Laptev, Ivan, Liang, Xiaodan
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.05672
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908964708417536
author Zhang, Kaidong
Zhang, Jian
Xu, Rongtao
Sun, Yu
Xue, Shuoshuo
Wen, Youpeng
Guo, Xiaoyu
Guo, Minghao
Liufu, Weijia
Zihou, Liu
Ji, Kangyi
Zhang, Yangsong
Zhu, Jiarun
Liu, Jingzhi
Li, Zihang
Chen, Ruiyi
Cao, Meng
Zhang, Jingming
Zhao, Shen
Chang, Xiaojun
Zheng, Feng
Laptev, Ivan
Liang, Xiaodan
author_facet Zhang, Kaidong
Zhang, Jian
Xu, Rongtao
Sun, Yu
Xue, Shuoshuo
Wen, Youpeng
Guo, Xiaoyu
Guo, Minghao
Liufu, Weijia
Zihou, Liu
Ji, Kangyi
Zhang, Yangsong
Zhu, Jiarun
Liu, Jingzhi
Li, Zihang
Chen, Ruiyi
Cao, Meng
Zhang, Jingming
Zhao, Shen
Chang, Xiaojun
Zheng, Feng
Laptev, Ivan
Liang, Xiaodan
contents Vision-Language-Action (VLA) models have emerged as a powerful paradigm for open-world robot manipulation, but their practical deployment is often constrained by cost: billion-scale VLM backbones and iterative diffusion/flow-based action heads incur high latency and compute, making real-time control expensive on commodity hardware. We present A1, a fully open-source and transparent VLA framework designed for low-cost, high-throughput inference without sacrificing manipulation success; Our approach leverages pretrained VLMs that provide implicit affordance priors for action generation. We release the full training stack (training code, data/data-processing pipeline, intermediate checkpoints, and evaluation scripts) to enable end-to-end reproducibility. Beyond optimizing the VLM alone, A1 targets the full inference pipeline by introducing a budget-aware adaptive inference scheme that jointly accelerates the backbone and the action head. Specifically, we monitor action consistency across intermediate VLM layers to trigger early termination, and propose Inter-Layer Truncated Flow Matching that warm-starts denoising across layers, enabling accurate actions with substantially fewer effective denoising iterations. Across simulation benchmarks (LIBERO, VLABench) and real robots (Franka, AgiBot), A1 achieves state-of-the-art success rates while significantly reducing inference cost (e.g., up to 72% lower per-episode latency for flow-matching inference and up to 76.6% backbone computation reduction with minor performance degradation). On RoboChallenge, A1 achieves an average success rate of 29.00%, outperforming baselines including pi0(28.33%), X-VLA (21.33%), and RDT-1B (15.00%).
format Preprint
id arxiv_https___arxiv_org_abs_2604_05672
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
Zhang, Kaidong
Zhang, Jian
Xu, Rongtao
Sun, Yu
Xue, Shuoshuo
Wen, Youpeng
Guo, Xiaoyu
Guo, Minghao
Liufu, Weijia
Zihou, Liu
Ji, Kangyi
Zhang, Yangsong
Zhu, Jiarun
Liu, Jingzhi
Li, Zihang
Chen, Ruiyi
Cao, Meng
Zhang, Jingming
Zhao, Shen
Chang, Xiaojun
Zheng, Feng
Laptev, Ivan
Liang, Xiaodan
Robotics
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for open-world robot manipulation, but their practical deployment is often constrained by cost: billion-scale VLM backbones and iterative diffusion/flow-based action heads incur high latency and compute, making real-time control expensive on commodity hardware. We present A1, a fully open-source and transparent VLA framework designed for low-cost, high-throughput inference without sacrificing manipulation success; Our approach leverages pretrained VLMs that provide implicit affordance priors for action generation. We release the full training stack (training code, data/data-processing pipeline, intermediate checkpoints, and evaluation scripts) to enable end-to-end reproducibility. Beyond optimizing the VLM alone, A1 targets the full inference pipeline by introducing a budget-aware adaptive inference scheme that jointly accelerates the backbone and the action head. Specifically, we monitor action consistency across intermediate VLM layers to trigger early termination, and propose Inter-Layer Truncated Flow Matching that warm-starts denoising across layers, enabling accurate actions with substantially fewer effective denoising iterations. Across simulation benchmarks (LIBERO, VLABench) and real robots (Franka, AgiBot), A1 achieves state-of-the-art success rates while significantly reducing inference cost (e.g., up to 72% lower per-episode latency for flow-matching inference and up to 76.6% backbone computation reduction with minor performance degradation). On RoboChallenge, A1 achieves an average success rate of 29.00%, outperforming baselines including pi0(28.33%), X-VLA (21.33%), and RDT-1B (15.00%).
title A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
topic Robotics
url https://arxiv.org/abs/2604.05672