Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Huang, Zeyi, Ji, Yuyang, Wang, Xiaofang, Mehta, Nikhil, Xiao, Tong, Lee, Donghyun, Vanvalkenburgh, Sigmund, Zha, Shengxin, Lai, Bolin, Ren, Yiqiu, Yu, Licheng, Zhang, Ning, Lee, Yong Jae, Liu, Miao
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2501.04336
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917313057390592
author	Huang, Zeyi Ji, Yuyang Wang, Xiaofang Mehta, Nikhil Xiao, Tong Lee, Donghyun Vanvalkenburgh, Sigmund Zha, Shengxin Lai, Bolin Ren, Yiqiu Yu, Licheng Zhang, Ning Lee, Yong Jae Liu, Miao
author_facet	Huang, Zeyi Ji, Yuyang Wang, Xiaofang Mehta, Nikhil Xiao, Tong Lee, Donghyun Vanvalkenburgh, Sigmund Zha, Shengxin Lai, Bolin Ren, Yiqiu Yu, Licheng Zhang, Ning Lee, Yong Jae Liu, Miao
contents	Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_04336
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs Huang, Zeyi Ji, Yuyang Wang, Xiaofang Mehta, Nikhil Xiao, Tong Lee, Donghyun Vanvalkenburgh, Sigmund Zha, Shengxin Lai, Bolin Ren, Yiqiu Yu, Licheng Zhang, Ning Lee, Yong Jae Liu, Miao Computer Vision and Pattern Recognition Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.
title	Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2501.04336

Similar Items