Saved in:
Bibliographic Details
Main Authors: Dai, Shiqi, Ma, Zizhi, Luo, Zhicong, Yang, Xuesong, Huang, Yibin, Zhang, Wanyue, Chen, Chi, Guo, Zonghao, Xu, Wang, Sun, Yufei, Sun, Maosong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.23219
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915697971429376
author Dai, Shiqi
Ma, Zizhi
Luo, Zhicong
Yang, Xuesong
Huang, Yibin
Zhang, Wanyue
Chen, Chi
Guo, Zonghao
Xu, Wang
Sun, Yufei
Sun, Maosong
author_facet Dai, Shiqi
Ma, Zizhi
Luo, Zhicong
Yang, Xuesong
Huang, Yibin
Zhang, Wanyue
Chen, Chi
Guo, Zonghao
Xu, Wang
Sun, Yufei
Sun, Maosong
contents While Multimodal Large Language Models (MLLMs) have exhibited remarkable general intelligence across diverse domains, their potential in low-altitude applications dominated by Unmanned Aerial Vehicles (UAVs) remains largely underexplored. Existing MLLM benchmarks rarely cover the unique challenges of low-altitude scenarios, while UAV-related evaluations mainly focus on specific tasks such as localization or navigation, without a unified evaluation of MLLMs'general intelligence. To bridge this gap, we present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios. MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets. Extensive experiments on 16 open-source and proprietary MLLMs reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios. Our analyses further uncover critical bottlenecks such as spatial bias and multi-view understanding that hinder the effective deployment of MLLMs in UAV scenarios. We hope MM-UAVBench will foster future research on robust and reliable MLLMs for real-world UAV intelligence.
format Preprint
id arxiv_https___arxiv_org_abs_2512_23219
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?
Dai, Shiqi
Ma, Zizhi
Luo, Zhicong
Yang, Xuesong
Huang, Yibin
Zhang, Wanyue
Chen, Chi
Guo, Zonghao
Xu, Wang
Sun, Yufei
Sun, Maosong
Computer Vision and Pattern Recognition
While Multimodal Large Language Models (MLLMs) have exhibited remarkable general intelligence across diverse domains, their potential in low-altitude applications dominated by Unmanned Aerial Vehicles (UAVs) remains largely underexplored. Existing MLLM benchmarks rarely cover the unique challenges of low-altitude scenarios, while UAV-related evaluations mainly focus on specific tasks such as localization or navigation, without a unified evaluation of MLLMs'general intelligence. To bridge this gap, we present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios. MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets. Extensive experiments on 16 open-source and proprietary MLLMs reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios. Our analyses further uncover critical bottlenecks such as spatial bias and multi-view understanding that hinder the effective deployment of MLLMs in UAV scenarios. We hope MM-UAVBench will foster future research on robust and reliable MLLMs for real-world UAV intelligence.
title MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2512.23219