Salvato in:
Dettagli Bibliografici
Autori principali: Huang, Chi-Pin, Man, Yunze, Yu, Zhiding, Chen, Min-Hung, Kautz, Jan, Wang, Yu-Chiang Frank, Yang, Fu-En
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2601.09708
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866912922553286656
author Huang, Chi-Pin
Man, Yunze
Yu, Zhiding
Chen, Min-Hung
Kautz, Jan
Wang, Yu-Chiang Frank
Yang, Fu-En
author_facet Huang, Chi-Pin
Man, Yunze
Yu, Zhiding
Chen, Min-Hung
Kautz, Jan
Wang, Yu-Chiang Frank
Yang, Fu-En
contents Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
format Preprint
id arxiv_https___arxiv_org_abs_2601_09708
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Huang, Chi-Pin
Man, Yunze
Yu, Zhiding
Chen, Min-Hung
Kautz, Jan
Wang, Yu-Chiang Frank
Yang, Fu-En
Computer Vision and Pattern Recognition
Artificial Intelligence
Machine Learning
Robotics
Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
title Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
topic Computer Vision and Pattern Recognition
Artificial Intelligence
Machine Learning
Robotics
url https://arxiv.org/abs/2601.09708