Saved in:
Bibliographic Details
Main Authors: Shao, Maanping, Zhang, Feihong, Zhang, Gu, Cheng, Baiye, Xue, Zhengrong, Xu, Huazhe
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.11269
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918292524892160
author Shao, Maanping
Zhang, Feihong
Zhang, Gu
Cheng, Baiye
Xue, Zhengrong
Xu, Huazhe
author_facet Shao, Maanping
Zhang, Feihong
Zhang, Gu
Cheng, Baiye
Xue, Zhengrong
Xu, Huazhe
contents Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.
format Preprint
id arxiv_https___arxiv_org_abs_2601_11269
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning
Shao, Maanping
Zhang, Feihong
Zhang, Gu
Cheng, Baiye
Xue, Zhengrong
Xu, Huazhe
Computer Vision and Pattern Recognition
Artificial Intelligence
Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.
title X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2601.11269