Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Haochen, Savaliya, Nirav, Siddiqui, Faizan, Sachdeva, Enna
Format:	Preprint
Published:	2026
Subjects:	Robotics
Online Access:	https://arxiv.org/abs/2602.15813
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915802608828416
author	Zhang, Haochen Savaliya, Nirav Siddiqui, Faizan Sachdeva, Enna
author_facet	Zhang, Haochen Savaliya, Nirav Siddiqui, Faizan Sachdeva, Enna
contents	Embodied Question Answering (EQA) combines visual scene understanding, goal-directed exploration, spatial and temporal reasoning under partial observability. A central challenge is to confine physical search to question-relevant subspaces while maintaining a compact, actionable memory of observations. Furthermore, for real-world deployment, fast inference time during exploration is crucial. We introduce FAST-EQA, a question-conditioned framework that (i) identifies likely visual targets, (ii) scores global regions of interest to guide navigation, and (iii) employs Chain-of-Thought (CoT) reasoning over visual memory to answer confidently. FAST-EQA maintains a bounded scene memory that stores a fixed-capacity set of region-target hypotheses and updates them online, enabling robust handling of both single and multi-target questions without unbounded growth. To expand coverage efficiently, a global exploration policy treats narrow openings and doors as high-value frontiers, complementing local target seeking with minimal computation. Together, these components focus the agent's attention, improve scene coverage, and improve answer reliability while running substantially faster than prior approaches. On HMEQA and EXPRESS-Bench, FAST-EQA achieves state-of-the-art performance, while performing competitively on OpenEQA and MT-HM3D.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_15813
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy Zhang, Haochen Savaliya, Nirav Siddiqui, Faizan Sachdeva, Enna Robotics Embodied Question Answering (EQA) combines visual scene understanding, goal-directed exploration, spatial and temporal reasoning under partial observability. A central challenge is to confine physical search to question-relevant subspaces while maintaining a compact, actionable memory of observations. Furthermore, for real-world deployment, fast inference time during exploration is crucial. We introduce FAST-EQA, a question-conditioned framework that (i) identifies likely visual targets, (ii) scores global regions of interest to guide navigation, and (iii) employs Chain-of-Thought (CoT) reasoning over visual memory to answer confidently. FAST-EQA maintains a bounded scene memory that stores a fixed-capacity set of region-target hypotheses and updates them online, enabling robust handling of both single and multi-target questions without unbounded growth. To expand coverage efficiently, a global exploration policy treats narrow openings and doors as high-value frontiers, complementing local target seeking with minimal computation. Together, these components focus the agent's attention, improve scene coverage, and improve answer reliability while running substantially faster than prior approaches. On HMEQA and EXPRESS-Bench, FAST-EQA achieves state-of-the-art performance, while performing competitively on OpenEQA and MT-HM3D.
title	FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy
topic	Robotics
url	https://arxiv.org/abs/2602.15813

Similar Items