Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhou, Wentao, Chen, Xuweiyi, Rajagopal, Vignesh, Chen, Jeffrey, Chandra, Rohan, Cheng, Zezhou
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2512.10956
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911314491736064
author	Zhou, Wentao Chen, Xuweiyi Rajagopal, Vignesh Chen, Jeffrey Chandra, Rohan Cheng, Zezhou
author_facet	Zhou, Wentao Chen, Xuweiyi Rajagopal, Vignesh Chen, Jeffrey Chandra, Rohan Cheng, Zezhou
contents	The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_10956
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision Zhou, Wentao Chen, Xuweiyi Rajagopal, Vignesh Chen, Jeffrey Chandra, Rohan Cheng, Zezhou Computer Vision and Pattern Recognition The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.
title	Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2512.10956

Similar Items