Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Kaidong, Zhang, Jian, Xu, Rongtao, Sun, Yu, Xue, Shuoshuo, Wen, Youpeng, Guo, Xiaoyu, Guo, Minghao, Liufu, Weijia, Zihou, Liu, Ji, Kangyi, Zhang, Yangsong, Zhu, Jiarun, Liu, Jingzhi, Li, Zihang, Chen, Ruiyi, Cao, Meng, Zhang, Jingming, Zhao, Shen, Chang, Xiaojun, Zheng, Feng, Laptev, Ivan, Liang, Xiaodan
Format:	Preprint
Published:	2026
Subjects:	Robotics
Online Access:	https://arxiv.org/abs/2604.05672
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908964708417536
author	Zhang, Kaidong Zhang, Jian Xu, Rongtao Sun, Yu Xue, Shuoshuo Wen, Youpeng Guo, Xiaoyu Guo, Minghao Liufu, Weijia Zihou, Liu Ji, Kangyi Zhang, Yangsong Zhu, Jiarun Liu, Jingzhi Li, Zihang Chen, Ruiyi Cao, Meng Zhang, Jingming Zhao, Shen Chang, Xiaojun Zheng, Feng Laptev, Ivan Liang, Xiaodan
author_facet	Zhang, Kaidong Zhang, Jian Xu, Rongtao Sun, Yu Xue, Shuoshuo Wen, Youpeng Guo, Xiaoyu Guo, Minghao Liufu, Weijia Zihou, Liu Ji, Kangyi Zhang, Yangsong Zhu, Jiarun Liu, Jingzhi Li, Zihang Chen, Ruiyi Cao, Meng Zhang, Jingming Zhao, Shen Chang, Xiaojun Zheng, Feng Laptev, Ivan Liang, Xiaodan
contents	Vision-Language-Action (VLA) models have emerged as a powerful paradigm for open-world robot manipulation, but their practical deployment is often constrained by cost: billion-scale VLM backbones and iterative diffusion/flow-based action heads incur high latency and compute, making real-time control expensive on commodity hardware. We present A1, a fully open-source and transparent VLA framework designed for low-cost, high-throughput inference without sacrificing manipulation success; Our approach leverages pretrained VLMs that provide implicit affordance priors for action generation. We release the full training stack (training code, data/data-processing pipeline, intermediate checkpoints, and evaluation scripts) to enable end-to-end reproducibility. Beyond optimizing the VLM alone, A1 targets the full inference pipeline by introducing a budget-aware adaptive inference scheme that jointly accelerates the backbone and the action head. Specifically, we monitor action consistency across intermediate VLM layers to trigger early termination, and propose Inter-Layer Truncated Flow Matching that warm-starts denoising across layers, enabling accurate actions with substantially fewer effective denoising iterations. Across simulation benchmarks (LIBERO, VLABench) and real robots (Franka, AgiBot), A1 achieves state-of-the-art success rates while significantly reducing inference cost (e.g., up to 72% lower per-episode latency for flow-matching inference and up to 76.6% backbone computation reduction with minor performance degradation). On RoboChallenge, A1 achieves an average success rate of 29.00%, outperforming baselines including pi0(28.33%), X-VLA (21.33%), and RDT-1B (15.00%).
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_05672
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model Zhang, Kaidong Zhang, Jian Xu, Rongtao Sun, Yu Xue, Shuoshuo Wen, Youpeng Guo, Xiaoyu Guo, Minghao Liufu, Weijia Zihou, Liu Ji, Kangyi Zhang, Yangsong Zhu, Jiarun Liu, Jingzhi Li, Zihang Chen, Ruiyi Cao, Meng Zhang, Jingming Zhao, Shen Chang, Xiaojun Zheng, Feng Laptev, Ivan Liang, Xiaodan Robotics Vision-Language-Action (VLA) models have emerged as a powerful paradigm for open-world robot manipulation, but their practical deployment is often constrained by cost: billion-scale VLM backbones and iterative diffusion/flow-based action heads incur high latency and compute, making real-time control expensive on commodity hardware. We present A1, a fully open-source and transparent VLA framework designed for low-cost, high-throughput inference without sacrificing manipulation success; Our approach leverages pretrained VLMs that provide implicit affordance priors for action generation. We release the full training stack (training code, data/data-processing pipeline, intermediate checkpoints, and evaluation scripts) to enable end-to-end reproducibility. Beyond optimizing the VLM alone, A1 targets the full inference pipeline by introducing a budget-aware adaptive inference scheme that jointly accelerates the backbone and the action head. Specifically, we monitor action consistency across intermediate VLM layers to trigger early termination, and propose Inter-Layer Truncated Flow Matching that warm-starts denoising across layers, enabling accurate actions with substantially fewer effective denoising iterations. Across simulation benchmarks (LIBERO, VLABench) and real robots (Franka, AgiBot), A1 achieves state-of-the-art success rates while significantly reducing inference cost (e.g., up to 72% lower per-episode latency for flow-matching inference and up to 76.6% backbone computation reduction with minor performance degradation). On RoboChallenge, A1 achieves an average success rate of 29.00%, outperforming baselines including pi0(28.33%), X-VLA (21.33%), and RDT-1B (15.00%).
title	A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
topic	Robotics
url	https://arxiv.org/abs/2604.05672

Similar Items