Salvato in:
| Autori principali: | , , , , , , |
|---|---|
| Natura: | Preprint |
| Pubblicazione: |
2026
|
| Soggetti: | |
| Accesso online: | https://arxiv.org/abs/2601.09708 |
| Tags: |
Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
|
| _version_ | 1866912922553286656 |
|---|---|
| author | Huang, Chi-Pin Man, Yunze Yu, Zhiding Chen, Min-Hung Kautz, Jan Wang, Yu-Chiang Frank Yang, Fu-En |
| author_facet | Huang, Chi-Pin Man, Yunze Yu, Zhiding Chen, Min-Hung Kautz, Jan Wang, Yu-Chiang Frank Yang, Fu-En |
| contents | Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_09708 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning Huang, Chi-Pin Man, Yunze Yu, Zhiding Chen, Min-Hung Kautz, Jan Wang, Yu-Chiang Frank Yang, Fu-En Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning Robotics Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery. |
| title | Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning |
| topic | Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning Robotics |
| url | https://arxiv.org/abs/2601.09708 |