Saved in:
Bibliographic Details
Main Authors: Xia, Xingyu, Zhou, Lekai, Tang, Yujie, Zhu, Xiaozhou, Zhu, Hai, Yao, Wen
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.07705
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914461100539904
author Xia, Xingyu
Zhou, Lekai
Tang, Yujie
Zhu, Xiaozhou
Zhu, Hai
Yao, Wen
author_facet Xia, Xingyu
Zhou, Lekai
Tang, Yujie
Zhu, Xiaozhou
Zhu, Hai
Yao, Wen
contents Aerial vision-and-language navigation (Aerial VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and autonomously navigate complex three-dimensional environments by grounding language in visual perception. This survey provides a critical and analytical review of the Aerial VLN field, with particular attention to the recent integration of large language models (LLMs) and vision-language models (VLMs). We first formally introduce the Aerial VLN problem and define two interaction paradigms: single-instruction and dialog-based, as foundational axes. We then organize the body of Aerial VLN methods into a taxonomy of five architectural categories: sequence-to-sequence and attention-based methods, end-to-end LLM/VLM methods, hierarchical methods, multi-agent methods, and dialog-based navigation methods. For each category, we systematically analyze design rationales, technical trade-offs, and reported performance. We critically assess the evaluation infrastructure for Aerial VLN, including datasets, simulation platforms, and metrics, and identify their gaps in scale, environmental diversity, real-world grounding, and metric coverage. We consolidate cross-method comparisons on shared benchmarks and analyze key architectural trade-offs, including discrete versus continuous actions, end-to-end versus hierarchical designs, and the simulation-to-reality gap. Finally, we synthesize seven concrete open problems: long-horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF action execution, onboard deployment, benchmark standardization, and multi-UAV swarm navigation, with specific research directions grounded in the evidence presented throughout the survey.
format Preprint
id arxiv_https___arxiv_org_abs_2604_07705
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models
Xia, Xingyu
Zhou, Lekai
Tang, Yujie
Zhu, Xiaozhou
Zhu, Hai
Yao, Wen
Robotics
Aerial vision-and-language navigation (Aerial VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and autonomously navigate complex three-dimensional environments by grounding language in visual perception. This survey provides a critical and analytical review of the Aerial VLN field, with particular attention to the recent integration of large language models (LLMs) and vision-language models (VLMs). We first formally introduce the Aerial VLN problem and define two interaction paradigms: single-instruction and dialog-based, as foundational axes. We then organize the body of Aerial VLN methods into a taxonomy of five architectural categories: sequence-to-sequence and attention-based methods, end-to-end LLM/VLM methods, hierarchical methods, multi-agent methods, and dialog-based navigation methods. For each category, we systematically analyze design rationales, technical trade-offs, and reported performance. We critically assess the evaluation infrastructure for Aerial VLN, including datasets, simulation platforms, and metrics, and identify their gaps in scale, environmental diversity, real-world grounding, and metric coverage. We consolidate cross-method comparisons on shared benchmarks and analyze key architectural trade-offs, including discrete versus continuous actions, end-to-end versus hierarchical designs, and the simulation-to-reality gap. Finally, we synthesize seven concrete open problems: long-horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF action execution, onboard deployment, benchmark standardization, and multi-UAV swarm navigation, with specific research directions grounded in the evidence presented throughout the survey.
title Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models
topic Robotics
url https://arxiv.org/abs/2604.07705