Saved in:
Bibliographic Details
Main Authors: Huang, Zeyi, Ji, Yuyang, Wang, Xiaofang, Mehta, Nikhil, Xiao, Tong, Lee, Donghyun, Vanvalkenburgh, Sigmund, Zha, Shengxin, Lai, Bolin, Ren, Yiqiu, Yu, Licheng, Zhang, Ning, Lee, Yong Jae, Liu, Miao
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2501.04336
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917313057390592
author Huang, Zeyi
Ji, Yuyang
Wang, Xiaofang
Mehta, Nikhil
Xiao, Tong
Lee, Donghyun
Vanvalkenburgh, Sigmund
Zha, Shengxin
Lai, Bolin
Ren, Yiqiu
Yu, Licheng
Zhang, Ning
Lee, Yong Jae
Liu, Miao
author_facet Huang, Zeyi
Ji, Yuyang
Wang, Xiaofang
Mehta, Nikhil
Xiao, Tong
Lee, Donghyun
Vanvalkenburgh, Sigmund
Zha, Shengxin
Lai, Bolin
Ren, Yiqiu
Yu, Licheng
Zhang, Ning
Lee, Yong Jae
Liu, Miao
contents Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.
format Preprint
id arxiv_https___arxiv_org_abs_2501_04336
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
Huang, Zeyi
Ji, Yuyang
Wang, Xiaofang
Mehta, Nikhil
Xiao, Tong
Lee, Donghyun
Vanvalkenburgh, Sigmund
Zha, Shengxin
Lai, Bolin
Ren, Yiqiu
Yu, Licheng
Zhang, Ning
Lee, Yong Jae
Liu, Miao
Computer Vision and Pattern Recognition
Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.
title Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2501.04336