Saved in:
Bibliographic Details
Main Authors: Dong, Shaoqi, Fu, Chaoyou, Gao, Haihan, Zhang, Yi-Fan, Yan, Chi, Wu, Chu, Liu, Xiaoyu, Shen, Yunhang, Huo, Jing, Jiang, Deqiang, Cao, Haoyu, Gao, Yang, Sun, Xing, He, Ran, Shan, Caifeng
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.09607
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918162343133184
author Dong, Shaoqi
Fu, Chaoyou
Gao, Haihan
Zhang, Yi-Fan
Yan, Chi
Wu, Chu
Liu, Xiaoyu
Shen, Yunhang
Huo, Jing
Jiang, Deqiang
Cao, Haoyu
Gao, Yang
Sun, Xing
He, Ran
Shan, Caifeng
author_facet Dong, Shaoqi
Fu, Chaoyou
Gao, Haihan
Zhang, Yi-Fan
Yan, Chi
Wu, Chu
Liu, Xiaoyu
Shen, Yunhang
Huo, Jing
Jiang, Deqiang
Cao, Haoyu
Gao, Yang
Sun, Xing
He, Ran
Shan, Caifeng
contents Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.
format Preprint
id arxiv_https___arxiv_org_abs_2510_09607
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
Dong, Shaoqi
Fu, Chaoyou
Gao, Haihan
Zhang, Yi-Fan
Yan, Chi
Wu, Chu
Liu, Xiaoyu
Shen, Yunhang
Huo, Jing
Jiang, Deqiang
Cao, Haoyu
Gao, Yang
Sun, Xing
He, Ran
Shan, Caifeng
Computer Vision and Pattern Recognition
Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.
title VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2510.09607