Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yu, Ryan, Zhang, Pushi, Liu, Starrick, Liu, Brae, Kang, Miracle, Li, Shalfun, Shi, Lights, Ma, Ellie, Yang, Ping, Pan, Chris, Chen, Jerry, Liu, Dongxiu, Sun, Rain, Guo, Miles, Zhang, Byron, Zhou, Hugo, Xu, Zach, Chen, Vincent, Huang, Harrison, Wang, James, Kuzi, Dance, Zhai, Andy, Su, Hang, Gan, Roy, Liang, Lucy, Wang, Hao, Wang, Qian
Format:	Preprint
Published:	2026
Subjects:	Robotics
Online Access:	https://arxiv.org/abs/2605.30877
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911738187743232
author	Yu, Ryan Zhang, Pushi Liu, Starrick Liu, Brae Kang, Miracle Li, Shalfun Shi, Lights Ma, Ellie Yang, Ping Pan, Chris Chen, Jerry Liu, Dongxiu Sun, Rain Guo, Miles Zhang, Byron Zhou, Hugo Xu, Zach Chen, Vincent Huang, Harrison Wang, James Kuzi, Dance Zhai, Andy Su, Hang Gan, Roy Liang, Lucy Wang, Hao Wang, Qian
author_facet	Yu, Ryan Zhang, Pushi Liu, Starrick Liu, Brae Kang, Miracle Li, Shalfun Shi, Lights Ma, Ellie Yang, Ping Pan, Chris Chen, Jerry Liu, Dongxiu Sun, Rain Guo, Miles Zhang, Byron Zhou, Hugo Xu, Zach Chen, Vincent Huang, Harrison Wang, James Kuzi, Dance Zhai, Andy Su, Hang Gan, Roy Liang, Lucy Wang, Hao Wang, Qian
contents	Large-scale Vision-Language-Action (VLA) pretraining is increasingly adopted as the foundation for robot policies, yet the evidence for pretrained VLAs is almost invariably reported after task-specific fine-tuning. This leaves a foundational question unanswered: does VLA pretraining itself yield executable robot behavior, or does it merely furnish a better initialization for downstream policy learning? We present Wall-OSS-0.5, an open-source 4B VLA built upon a 3B VLM backbone augmented with action-generation components, designed so that pretrained robotic capability is directly measurable on physical hardware. The model is pretrained across more than 20 embodiments, processing over one million robot trajectories per epoch alongside a grounded multimodal corpus. We adopt a gradient-bridged co-training recipe in which three objectives play distinct and complementary roles: discrete action prediction routes strong VLM-native gradients into the backbone, multimodal prediction preserves grounded vision-language understanding, and continuous flow matching serves as the deployment-time action interface. Before task-specific fine-tuning, the pretrained checkpoint achieves non-trivial zero-shot real-robot behavior, completing several tasks, including a held-out deformable manipulation task, at high task progress on a 17-task suite. After fine-tuning, the same checkpoint serves as a stronger adaptation prior, reaching 60.5% average task progress on 15 real-robot tasks and outperforming π_0.5 by 17.5%. Multimodal evaluations further confirm that action training does not erode grounded vision-language competence: the model preserves broad vision-language ability while strengthening embodied grounding. Together, these results reposition VLA pretraining from an initialization strategy to a directly testable, already useful source of robot capability.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_30877
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Wall-OSS-0.5 Technical Report Yu, Ryan Zhang, Pushi Liu, Starrick Liu, Brae Kang, Miracle Li, Shalfun Shi, Lights Ma, Ellie Yang, Ping Pan, Chris Chen, Jerry Liu, Dongxiu Sun, Rain Guo, Miles Zhang, Byron Zhou, Hugo Xu, Zach Chen, Vincent Huang, Harrison Wang, James Kuzi, Dance Zhai, Andy Su, Hang Gan, Roy Liang, Lucy Wang, Hao Wang, Qian Robotics Large-scale Vision-Language-Action (VLA) pretraining is increasingly adopted as the foundation for robot policies, yet the evidence for pretrained VLAs is almost invariably reported after task-specific fine-tuning. This leaves a foundational question unanswered: does VLA pretraining itself yield executable robot behavior, or does it merely furnish a better initialization for downstream policy learning? We present Wall-OSS-0.5, an open-source 4B VLA built upon a 3B VLM backbone augmented with action-generation components, designed so that pretrained robotic capability is directly measurable on physical hardware. The model is pretrained across more than 20 embodiments, processing over one million robot trajectories per epoch alongside a grounded multimodal corpus. We adopt a gradient-bridged co-training recipe in which three objectives play distinct and complementary roles: discrete action prediction routes strong VLM-native gradients into the backbone, multimodal prediction preserves grounded vision-language understanding, and continuous flow matching serves as the deployment-time action interface. Before task-specific fine-tuning, the pretrained checkpoint achieves non-trivial zero-shot real-robot behavior, completing several tasks, including a held-out deformable manipulation task, at high task progress on a 17-task suite. After fine-tuning, the same checkpoint serves as a stronger adaptation prior, reaching 60.5% average task progress on 15 real-robot tasks and outperforming π_0.5 by 17.5%. Multimodal evaluations further confirm that action training does not erode grounded vision-language competence: the model preserves broad vision-language ability while strengthening embodied grounding. Together, these results reposition VLA pretraining from an initialization strategy to a directly testable, already useful source of robot capability.
title	Wall-OSS-0.5 Technical Report
topic	Robotics
url	https://arxiv.org/abs/2605.30877

Similar Items