Saved in:
Bibliographic Details
Main Authors: Wang, Zeyu, Liu, Chang, Tjitrahardja, Eduardus, Wang, Yuntao, Pavlov, Borislav, Gou, Fangfei, Davila, Jose Manuel, Shi, Dai, Xu, Ran, Pan, Yue, Tan, Jiayi, Chang, Shuting, Wang, Qi, Li, Jinzhao, Hua, Jiacheng, Huang, Yifei, Sun, Jingwei, Zhang, Yu, Zhang, Liuxin, Yao, Guocai, Jia, Jia, Li, Yin, Wang, Qianying, Shi, Yuanchun, Liu, Miao
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.17262
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914575000010752
author Wang, Zeyu
Liu, Chang
Tjitrahardja, Eduardus
Wang, Yuntao
Pavlov, Borislav
Gou, Fangfei
Davila, Jose Manuel
Shi, Dai
Xu, Ran
Pan, Yue
Tan, Jiayi
Chang, Shuting
Wang, Qi
Li, Jinzhao
Hua, Jiacheng
Huang, Yifei
Sun, Jingwei
Zhang, Yu
Zhang, Liuxin
Yao, Guocai
Jia, Jia
Li, Yin
Wang, Qianying
Shi, Yuanchun
Liu, Miao
author_facet Wang, Zeyu
Liu, Chang
Tjitrahardja, Eduardus
Wang, Yuntao
Pavlov, Borislav
Gou, Fangfei
Davila, Jose Manuel
Shi, Dai
Xu, Ran
Pan, Yue
Tan, Jiayi
Chang, Shuting
Wang, Qi
Li, Jinzhao
Hua, Jiacheng
Huang, Yifei
Sun, Jingwei
Zhang, Yu
Zhang, Liuxin
Yao, Guocai
Jia, Jia
Li, Yin
Wang, Qianying
Shi, Yuanchun
Liu, Miao
contents Despite extensive efforts on egocentric video datasets and benchmarks, understanding users' internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users' interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users' internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users' subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/
format Preprint
id arxiv_https___arxiv_org_abs_2605_17262
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning
Wang, Zeyu
Liu, Chang
Tjitrahardja, Eduardus
Wang, Yuntao
Pavlov, Borislav
Gou, Fangfei
Davila, Jose Manuel
Shi, Dai
Xu, Ran
Pan, Yue
Tan, Jiayi
Chang, Shuting
Wang, Qi
Li, Jinzhao
Hua, Jiacheng
Huang, Yifei
Sun, Jingwei
Zhang, Yu
Zhang, Liuxin
Yao, Guocai
Jia, Jia
Li, Yin
Wang, Qianying
Shi, Yuanchun
Liu, Miao
Computer Vision and Pattern Recognition
Despite extensive efforts on egocentric video datasets and benchmarks, understanding users' internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users' interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users' internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users' subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/
title EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2605.17262