Saved in:
Bibliographic Details
Main Authors: Liu, Lulin, Li, Dayou, Liang, Yiqing, Jiang, Sicong, Vijay, Hitesh, Hu, Hezhen, Xu, Xuhai, Liu, Zirui, Shakkottai, Srinivas, Li, Manling, Fan, Zhiwen
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.09535
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915930392494080
author Liu, Lulin
Li, Dayou
Liang, Yiqing
Jiang, Sicong
Vijay, Hitesh
Hu, Hezhen
Xu, Xuhai
Liu, Zirui
Shakkottai, Srinivas
Li, Manling
Fan, Zhiwen
author_facet Liu, Lulin
Li, Dayou
Liang, Yiqing
Jiang, Sicong
Vijay, Hitesh
Hu, Hezhen
Xu, Xuhai
Liu, Zirui
Shakkottai, Srinivas
Li, Manling
Fan, Zhiwen
contents Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.
format Preprint
id arxiv_https___arxiv_org_abs_2604_09535
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
Liu, Lulin
Li, Dayou
Liang, Yiqing
Jiang, Sicong
Vijay, Hitesh
Hu, Hezhen
Xu, Xuhai
Liu, Zirui
Shakkottai, Srinivas
Li, Manling
Fan, Zhiwen
Computer Vision and Pattern Recognition
Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.
title EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2604.09535