Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhao, Sizhe, Zhang, Shengping, Yang, Shuo, Zhao, Weiyu, Wang, Shuigen, Ji, Xiangyang
Format:	Preprint
Published:	2026
Subjects:	Robotics Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.25547
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917531010203648
author	Zhao, Sizhe Zhang, Shengping Yang, Shuo Zhao, Weiyu Wang, Shuigen Ji, Xiangyang
author_facet	Zhao, Sizhe Zhang, Shengping Yang, Shuo Zhao, Weiyu Wang, Shuigen Ji, Xiangyang
contents	Existing embodied control research demonstrates remarkable performance improvements by scaling training data and model size. We instead explore inference-time strategy as an alternative axis. Non-deterministic generative models, such as diffusion and autoregressive models, have been widely adopted in the field of embodied control. However, the single-shot inference paradigm limits their performance. In this paper, we propose \textbf{TapSampling}, a plug-and-play framework for inference-time sampling. First, we introduce an Action-VAE that represents actions in a low-dimensional latent space by mapping policy-generated initial actions into a compressed posterior distribution, from which any number of latent samples can be drawn and decoded into candidate actions that approximate the true action distribution. Second, we formulate action verification as task-progress outcome prediction, using the intrinsic sequential structure of robotic datasets to train a semantically grounded verifier for interpretable action selection. Furthermore, TapSampling is a policy-agnostic framework. Extensive experiments in both simulated and real-world environments demonstrate that our method substantially improves multiple generalist policies without further policy finetuning. Code and models are available at the project page.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_25547
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	TapSampling: Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation Zhao, Sizhe Zhang, Shengping Yang, Shuo Zhao, Weiyu Wang, Shuigen Ji, Xiangyang Robotics Computer Vision and Pattern Recognition Existing embodied control research demonstrates remarkable performance improvements by scaling training data and model size. We instead explore inference-time strategy as an alternative axis. Non-deterministic generative models, such as diffusion and autoregressive models, have been widely adopted in the field of embodied control. However, the single-shot inference paradigm limits their performance. In this paper, we propose \textbf{TapSampling}, a plug-and-play framework for inference-time sampling. First, we introduce an Action-VAE that represents actions in a low-dimensional latent space by mapping policy-generated initial actions into a compressed posterior distribution, from which any number of latent samples can be drawn and decoded into candidate actions that approximate the true action distribution. Second, we formulate action verification as task-progress outcome prediction, using the intrinsic sequential structure of robotic datasets to train a semantically grounded verifier for interpretable action selection. Furthermore, TapSampling is a policy-agnostic framework. Extensive experiments in both simulated and real-world environments demonstrate that our method substantially improves multiple generalist policies without further policy finetuning. Code and models are available at the project page.
title	TapSampling: Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation
topic	Robotics Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2605.25547

Similar Items