Saved in:
Bibliographic Details
Main Authors: Wang, Guodong, Zhang, Chenkai, Liu, Qingjie, Zhang, Jinjin, Cai, Jiancheng, Liu, Junjie, Liu, Xinmin
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.06556
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911427721166848
author Wang, Guodong
Zhang, Chenkai
Liu, Qingjie
Zhang, Jinjin
Cai, Jiancheng
Liu, Junjie
Liu, Xinmin
author_facet Wang, Guodong
Zhang, Chenkai
Liu, Qingjie
Zhang, Jinjin
Cai, Jiancheng
Liu, Junjie
Liu, Xinmin
contents Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of perception with language-driven manipulation tasks. However, existing benchmarks often provide limited or misleading assessments due to insufficient evaluation protocols that inadequately capture real-world distribution shifts. This work systematically rethinks VLA benchmarking from both evaluation and data perspectives, introducing LIBERO-X, a benchmark featuring: 1) A hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities: spatial generalization, object recognition, and task instruction understanding. This design enables fine-grained analysis of performance degradation under increasing environmental and task complexity; 2) A high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives to bridge the train-evaluation distribution gap. Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. By integrating hierarchical evaluation with diverse training data, LIBERO-X offers a more reliable foundation for assessing and advancing VLA development.
format Preprint
id arxiv_https___arxiv_org_abs_2602_06556
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle LIBERO-X: Robustness Litmus for Vision-Language-Action Models
Wang, Guodong
Zhang, Chenkai
Liu, Qingjie
Zhang, Jinjin
Cai, Jiancheng
Liu, Junjie
Liu, Xinmin
Computer Vision and Pattern Recognition
Artificial Intelligence
Robotics
Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of perception with language-driven manipulation tasks. However, existing benchmarks often provide limited or misleading assessments due to insufficient evaluation protocols that inadequately capture real-world distribution shifts. This work systematically rethinks VLA benchmarking from both evaluation and data perspectives, introducing LIBERO-X, a benchmark featuring: 1) A hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities: spatial generalization, object recognition, and task instruction understanding. This design enables fine-grained analysis of performance degradation under increasing environmental and task complexity; 2) A high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives to bridge the train-evaluation distribution gap. Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. By integrating hierarchical evaluation with diverse training data, LIBERO-X offers a more reliable foundation for assessing and advancing VLA development.
title LIBERO-X: Robustness Litmus for Vision-Language-Action Models
topic Computer Vision and Pattern Recognition
Artificial Intelligence
Robotics
url https://arxiv.org/abs/2602.06556