Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Yao, Sun, Zhirui, Chi, Wenzheng, Jia, Baozhi, Xu, Wenjun, Wang, Jiankun
Format:	Preprint
Published:	2025
Subjects:	Robotics
Online Access:	https://arxiv.org/abs/2509.24321
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908565146435584
author	Wang, Yao Sun, Zhirui Chi, Wenzheng Jia, Baozhi Xu, Wenjun Wang, Jiankun
author_facet	Wang, Yao Sun, Zhirui Chi, Wenzheng Jia, Baozhi Xu, Wenjun Wang, Jiankun
contents	Understanding human instructions and accomplishing Vision-Language Navigation tasks in unknown environments is essential for robots. However, existing modular approaches heavily rely on the quality of training data and often exhibit poor generalization. Vision-Language Model based methods, while demonstrating strong generalization capabilities, tend to perform unsatisfactorily when semantic cues are weak. To address these issues, this paper proposes SONAR, an aggregated reasoning approach through a cross modal paradigm. The proposed method integrates a semantic map based target prediction module with a Vision-Language Model based value map module, enabling more robust navigation in unknown environments with varying levels of semantic cues, and effectively balancing generalization ability with scene adaptability. In terms of target localization, we propose a strategy that integrates multi-scale semantic maps with confidence maps, aiming to mitigate false detections of target objects. We conducted an evaluation of the SONAR within the Gazebo simulator, leveraging the most challenging Matterport 3D (MP3D) dataset as the experimental benchmark. Experimental results demonstrate that SONAR achieves a success rate of 38.4% and an SPL of 17.7%.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_24321
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	SONAR: Semantic-Object Navigation with Aggregated Reasoning through a Cross-Modal Inference Paradigm Wang, Yao Sun, Zhirui Chi, Wenzheng Jia, Baozhi Xu, Wenjun Wang, Jiankun Robotics Understanding human instructions and accomplishing Vision-Language Navigation tasks in unknown environments is essential for robots. However, existing modular approaches heavily rely on the quality of training data and often exhibit poor generalization. Vision-Language Model based methods, while demonstrating strong generalization capabilities, tend to perform unsatisfactorily when semantic cues are weak. To address these issues, this paper proposes SONAR, an aggregated reasoning approach through a cross modal paradigm. The proposed method integrates a semantic map based target prediction module with a Vision-Language Model based value map module, enabling more robust navigation in unknown environments with varying levels of semantic cues, and effectively balancing generalization ability with scene adaptability. In terms of target localization, we propose a strategy that integrates multi-scale semantic maps with confidence maps, aiming to mitigate false detections of target objects. We conducted an evaluation of the SONAR within the Gazebo simulator, leveraging the most challenging Matterport 3D (MP3D) dataset as the experimental benchmark. Experimental results demonstrate that SONAR achieves a success rate of 38.4% and an SPL of 17.7%.
title	SONAR: Semantic-Object Navigation with Aggregated Reasoning through a Cross-Modal Inference Paradigm
topic	Robotics
url	https://arxiv.org/abs/2509.24321

Similar Items