Saved in:
Bibliographic Details
Main Authors: Yu, Ryan, Zhang, Pushi, Liu, Starrick, Liu, Brae, Kang, Miracle, Li, Shalfun, Shi, Lights, Ma, Ellie, Yang, Ping, Pan, Chris, Chen, Jerry, Liu, Dongxiu, Sun, Rain, Guo, Miles, Zhang, Byron, Zhou, Hugo, Xu, Zach, Chen, Vincent, Huang, Harrison, Wang, James, Kuzi, Dance, Zhai, Andy, Su, Hang, Gan, Roy, Liang, Lucy, Wang, Hao, Wang, Qian
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.30877
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911738187743232
author Yu, Ryan
Zhang, Pushi
Liu, Starrick
Liu, Brae
Kang, Miracle
Li, Shalfun
Shi, Lights
Ma, Ellie
Yang, Ping
Pan, Chris
Chen, Jerry
Liu, Dongxiu
Sun, Rain
Guo, Miles
Zhang, Byron
Zhou, Hugo
Xu, Zach
Chen, Vincent
Huang, Harrison
Wang, James
Kuzi, Dance
Zhai, Andy
Su, Hang
Gan, Roy
Liang, Lucy
Wang, Hao
Wang, Qian
author_facet Yu, Ryan
Zhang, Pushi
Liu, Starrick
Liu, Brae
Kang, Miracle
Li, Shalfun
Shi, Lights
Ma, Ellie
Yang, Ping
Pan, Chris
Chen, Jerry
Liu, Dongxiu
Sun, Rain
Guo, Miles
Zhang, Byron
Zhou, Hugo
Xu, Zach
Chen, Vincent
Huang, Harrison
Wang, James
Kuzi, Dance
Zhai, Andy
Su, Hang
Gan, Roy
Liang, Lucy
Wang, Hao
Wang, Qian
contents Large-scale Vision-Language-Action (VLA) pretraining is increasingly adopted as the foundation for robot policies, yet the evidence for pretrained VLAs is almost invariably reported after task-specific fine-tuning. This leaves a foundational question unanswered: does VLA pretraining itself yield executable robot behavior, or does it merely furnish a better initialization for downstream policy learning? We present Wall-OSS-0.5, an open-source 4B VLA built upon a 3B VLM backbone augmented with action-generation components, designed so that pretrained robotic capability is directly measurable on physical hardware. The model is pretrained across more than 20 embodiments, processing over one million robot trajectories per epoch alongside a grounded multimodal corpus. We adopt a gradient-bridged co-training recipe in which three objectives play distinct and complementary roles: discrete action prediction routes strong VLM-native gradients into the backbone, multimodal prediction preserves grounded vision-language understanding, and continuous flow matching serves as the deployment-time action interface. Before task-specific fine-tuning, the pretrained checkpoint achieves non-trivial zero-shot real-robot behavior, completing several tasks, including a held-out deformable manipulation task, at high task progress on a 17-task suite. After fine-tuning, the same checkpoint serves as a stronger adaptation prior, reaching 60.5% average task progress on 15 real-robot tasks and outperforming π_0.5 by 17.5%. Multimodal evaluations further confirm that action training does not erode grounded vision-language competence: the model preserves broad vision-language ability while strengthening embodied grounding. Together, these results reposition VLA pretraining from an initialization strategy to a directly testable, already useful source of robot capability.
format Preprint
id arxiv_https___arxiv_org_abs_2605_30877
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Wall-OSS-0.5 Technical Report
Yu, Ryan
Zhang, Pushi
Liu, Starrick
Liu, Brae
Kang, Miracle
Li, Shalfun
Shi, Lights
Ma, Ellie
Yang, Ping
Pan, Chris
Chen, Jerry
Liu, Dongxiu
Sun, Rain
Guo, Miles
Zhang, Byron
Zhou, Hugo
Xu, Zach
Chen, Vincent
Huang, Harrison
Wang, James
Kuzi, Dance
Zhai, Andy
Su, Hang
Gan, Roy
Liang, Lucy
Wang, Hao
Wang, Qian
Robotics
Large-scale Vision-Language-Action (VLA) pretraining is increasingly adopted as the foundation for robot policies, yet the evidence for pretrained VLAs is almost invariably reported after task-specific fine-tuning. This leaves a foundational question unanswered: does VLA pretraining itself yield executable robot behavior, or does it merely furnish a better initialization for downstream policy learning? We present Wall-OSS-0.5, an open-source 4B VLA built upon a 3B VLM backbone augmented with action-generation components, designed so that pretrained robotic capability is directly measurable on physical hardware. The model is pretrained across more than 20 embodiments, processing over one million robot trajectories per epoch alongside a grounded multimodal corpus. We adopt a gradient-bridged co-training recipe in which three objectives play distinct and complementary roles: discrete action prediction routes strong VLM-native gradients into the backbone, multimodal prediction preserves grounded vision-language understanding, and continuous flow matching serves as the deployment-time action interface. Before task-specific fine-tuning, the pretrained checkpoint achieves non-trivial zero-shot real-robot behavior, completing several tasks, including a held-out deformable manipulation task, at high task progress on a 17-task suite. After fine-tuning, the same checkpoint serves as a stronger adaptation prior, reaching 60.5% average task progress on 15 real-robot tasks and outperforming π_0.5 by 17.5%. Multimodal evaluations further confirm that action training does not erode grounded vision-language competence: the model preserves broad vision-language ability while strengthening embodied grounding. Together, these results reposition VLA pretraining from an initialization strategy to a directly testable, already useful source of robot capability.
title Wall-OSS-0.5 Technical Report
topic Robotics
url https://arxiv.org/abs/2605.30877