Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Lulin, Li, Dayou, Liang, Yiqing, Jiang, Sicong, Vijay, Hitesh, Hu, Hezhen, Xu, Xuhai, Liu, Zirui, Shakkottai, Srinivas, Li, Manling, Fan, Zhiwen
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.09535
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915930392494080
author	Liu, Lulin Li, Dayou Liang, Yiqing Jiang, Sicong Vijay, Hitesh Hu, Hezhen Xu, Xuhai Liu, Zirui Shakkottai, Srinivas Li, Manling Fan, Zhiwen
author_facet	Liu, Lulin Li, Dayou Liang, Yiqing Jiang, Sicong Vijay, Hitesh Hu, Hezhen Xu, Xuhai Liu, Zirui Shakkottai, Srinivas Li, Manling Fan, Zhiwen
contents	Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_09535
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks Liu, Lulin Li, Dayou Liang, Yiqing Jiang, Sicong Vijay, Hitesh Hu, Hezhen Xu, Xuhai Liu, Zirui Shakkottai, Srinivas Li, Manling Fan, Zhiwen Computer Vision and Pattern Recognition Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.
title	EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.09535

Similar Items