Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.05672 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908964708417536 |
|---|---|
| author | Zhang, Kaidong Zhang, Jian Xu, Rongtao Sun, Yu Xue, Shuoshuo Wen, Youpeng Guo, Xiaoyu Guo, Minghao Liufu, Weijia Zihou, Liu Ji, Kangyi Zhang, Yangsong Zhu, Jiarun Liu, Jingzhi Li, Zihang Chen, Ruiyi Cao, Meng Zhang, Jingming Zhao, Shen Chang, Xiaojun Zheng, Feng Laptev, Ivan Liang, Xiaodan |
| author_facet | Zhang, Kaidong Zhang, Jian Xu, Rongtao Sun, Yu Xue, Shuoshuo Wen, Youpeng Guo, Xiaoyu Guo, Minghao Liufu, Weijia Zihou, Liu Ji, Kangyi Zhang, Yangsong Zhu, Jiarun Liu, Jingzhi Li, Zihang Chen, Ruiyi Cao, Meng Zhang, Jingming Zhao, Shen Chang, Xiaojun Zheng, Feng Laptev, Ivan Liang, Xiaodan |
| contents | Vision-Language-Action (VLA) models have emerged as a powerful paradigm for open-world robot manipulation, but their practical deployment is often constrained by cost: billion-scale VLM backbones and iterative diffusion/flow-based action heads incur high latency and compute, making real-time control expensive on commodity hardware. We present A1, a fully open-source and transparent VLA framework designed for low-cost, high-throughput inference without sacrificing manipulation success; Our approach leverages pretrained VLMs that provide implicit affordance priors for action generation. We release the full training stack (training code, data/data-processing pipeline, intermediate checkpoints, and evaluation scripts) to enable end-to-end reproducibility. Beyond optimizing the VLM alone, A1 targets the full inference pipeline by introducing a budget-aware adaptive inference scheme that jointly accelerates the backbone and the action head. Specifically, we monitor action consistency across intermediate VLM layers to trigger early termination, and propose Inter-Layer Truncated Flow Matching that warm-starts denoising across layers, enabling accurate actions with substantially fewer effective denoising iterations. Across simulation benchmarks (LIBERO, VLABench) and real robots (Franka, AgiBot), A1 achieves state-of-the-art success rates while significantly reducing inference cost (e.g., up to 72% lower per-episode latency for flow-matching inference and up to 76.6% backbone computation reduction with minor performance degradation). On RoboChallenge, A1 achieves an average success rate of 29.00%, outperforming baselines including pi0(28.33%), X-VLA (21.33%), and RDT-1B (15.00%). |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_05672 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model Zhang, Kaidong Zhang, Jian Xu, Rongtao Sun, Yu Xue, Shuoshuo Wen, Youpeng Guo, Xiaoyu Guo, Minghao Liufu, Weijia Zihou, Liu Ji, Kangyi Zhang, Yangsong Zhu, Jiarun Liu, Jingzhi Li, Zihang Chen, Ruiyi Cao, Meng Zhang, Jingming Zhao, Shen Chang, Xiaojun Zheng, Feng Laptev, Ivan Liang, Xiaodan Robotics Vision-Language-Action (VLA) models have emerged as a powerful paradigm for open-world robot manipulation, but their practical deployment is often constrained by cost: billion-scale VLM backbones and iterative diffusion/flow-based action heads incur high latency and compute, making real-time control expensive on commodity hardware. We present A1, a fully open-source and transparent VLA framework designed for low-cost, high-throughput inference without sacrificing manipulation success; Our approach leverages pretrained VLMs that provide implicit affordance priors for action generation. We release the full training stack (training code, data/data-processing pipeline, intermediate checkpoints, and evaluation scripts) to enable end-to-end reproducibility. Beyond optimizing the VLM alone, A1 targets the full inference pipeline by introducing a budget-aware adaptive inference scheme that jointly accelerates the backbone and the action head. Specifically, we monitor action consistency across intermediate VLM layers to trigger early termination, and propose Inter-Layer Truncated Flow Matching that warm-starts denoising across layers, enabling accurate actions with substantially fewer effective denoising iterations. Across simulation benchmarks (LIBERO, VLABench) and real robots (Franka, AgiBot), A1 achieves state-of-the-art success rates while significantly reducing inference cost (e.g., up to 72% lower per-episode latency for flow-matching inference and up to 76.6% backbone computation reduction with minor performance degradation). On RoboChallenge, A1 achieves an average success rate of 29.00%, outperforming baselines including pi0(28.33%), X-VLA (21.33%), and RDT-1B (15.00%). |
| title | A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model |
| topic | Robotics |
| url | https://arxiv.org/abs/2604.05672 |