Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.10085 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866918360537628672 |
|---|---|
| author | Ziakas, Christos Russo, Alessandra |
| author_facet | Ziakas, Christos Russo, Alessandra |
| contents | Vision-Language Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art zero-shot method using autoregressive VLMs. Furthermore, we demonstrate that VITA's zero-shot value estimates can be utilized for reward shaping in offline reinforcement learning, resulting in multi-task policies on the Meta-World benchmark that exceed the performance of those trained with the simulation's fuzzy-logic dense rewards. Project website: https://chziakas.github.io/vita/. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2506_10085 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models Ziakas, Christos Russo, Alessandra Computer Vision and Pattern Recognition Artificial Intelligence I.2.6; I.2.9; I.2.10 Vision-Language Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art zero-shot method using autoregressive VLMs. Furthermore, we demonstrate that VITA's zero-shot value estimates can be utilized for reward shaping in offline reinforcement learning, resulting in multi-task policies on the Meta-World benchmark that exceed the performance of those trained with the simulation's fuzzy-logic dense rewards. Project website: https://chziakas.github.io/vita/. |
| title | VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models |
| topic | Computer Vision and Pattern Recognition Artificial Intelligence I.2.6; I.2.9; I.2.10 |
| url | https://arxiv.org/abs/2506.10085 |