Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shao, Maanping, Zhang, Feihong, Zhang, Gu, Cheng, Baiye, Xue, Zhengrong, Xu, Huazhe
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.11269
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918292524892160
author	Shao, Maanping Zhang, Feihong Zhang, Gu Cheng, Baiye Xue, Zhengrong Xu, Huazhe
author_facet	Shao, Maanping Zhang, Feihong Zhang, Gu Cheng, Baiye Xue, Zhengrong Xu, Huazhe
contents	Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_11269
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning Shao, Maanping Zhang, Feihong Zhang, Gu Cheng, Baiye Xue, Zhengrong Xu, Huazhe Computer Vision and Pattern Recognition Artificial Intelligence Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.
title	X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2601.11269

Similar Items