Saved in:
Bibliographic Details
Main Authors: Gao, Yupeng, Li, Tianyu, Wang, Guoqing, Yang, Yang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.04409
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910194170068992
author Gao, Yupeng
Li, Tianyu
Wang, Guoqing
Yang, Yang
author_facet Gao, Yupeng
Li, Tianyu
Wang, Guoqing
Yang, Yang
contents Remote Sensing Image Change Captioning (RSICC) aims to generate spatially grounded natural language descriptions of scene evolution from bi-temporal imagery, moving beyond binary change masks toward semantic-level understanding. However, existing methods rely on implicit feature differencing without explicitly modeling structured change semantics, and struggle to reconcile the conflicting representation demands of change detection and caption generation. In addition, current benchmarks provide limited coverage of high-resolution urban construction scenarios. To address these challenges, we propose PTNet, a prototype-guided task-adaptive framework for joint change captioning and detection. PTNet explicitly models structured change semantics through a learnable prototype bank that guides cross-temporal interaction, disentangles task-specific representations via multi-head gating, and injects detection-derived spatial priors into caption generation, enabling coherent semantic correspondence while preserving fine-grained spatial sensitivity. Furthermore, we construct UCCD, a large-scale UAV-based benchmark comprising 9,000 high-resolution image pairs and 45,000 annotated sentences for urban construction monitoring. Extensive experiments on UCCD and WHU-CDC demonstrate that PTNet consistently outperforms existing methods. The dataset and source code are publicly available at https://github.com/G124556/ptnet.
format Preprint
id arxiv_https___arxiv_org_abs_2605_04409
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle UAV as Urban Construction Change Monitor: A New Benchmark and Change Captioning Model
Gao, Yupeng
Li, Tianyu
Wang, Guoqing
Yang, Yang
Computer Vision and Pattern Recognition
Remote Sensing Image Change Captioning (RSICC) aims to generate spatially grounded natural language descriptions of scene evolution from bi-temporal imagery, moving beyond binary change masks toward semantic-level understanding. However, existing methods rely on implicit feature differencing without explicitly modeling structured change semantics, and struggle to reconcile the conflicting representation demands of change detection and caption generation. In addition, current benchmarks provide limited coverage of high-resolution urban construction scenarios. To address these challenges, we propose PTNet, a prototype-guided task-adaptive framework for joint change captioning and detection. PTNet explicitly models structured change semantics through a learnable prototype bank that guides cross-temporal interaction, disentangles task-specific representations via multi-head gating, and injects detection-derived spatial priors into caption generation, enabling coherent semantic correspondence while preserving fine-grained spatial sensitivity. Furthermore, we construct UCCD, a large-scale UAV-based benchmark comprising 9,000 high-resolution image pairs and 45,000 annotated sentences for urban construction monitoring. Extensive experiments on UCCD and WHU-CDC demonstrate that PTNet consistently outperforms existing methods. The dataset and source code are publicly available at https://github.com/G124556/ptnet.
title UAV as Urban Construction Change Monitor: A New Benchmark and Change Captioning Model
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2605.04409