Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ke, Shuyan, Mei, Yifan, Wu, Changli, Zheng, Yonghan, Ji, Jiayi, Cao, Liujuan, Ji, Rongrong
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.15670
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914482303795200
author	Ke, Shuyan Mei, Yifan Wu, Changli Zheng, Yonghan Ji, Jiayi Cao, Liujuan Ji, Rongrong
author_facet	Ke, Shuyan Mei, Yifan Wu, Changli Zheng, Yonghan Ji, Jiayi Cao, Liujuan Ji, Rongrong
contents	Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_15670
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation Ke, Shuyan Mei, Yifan Wu, Changli Zheng, Yonghan Ji, Jiayi Cao, Liujuan Ji, Rongrong Computer Vision and Pattern Recognition Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.
title	PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.15670

Similar Items