Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Fei, Chen, Chengcheng, Chen, Hongyu, Chang, Yugang, Zeng, Weiming
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.08144
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915206154682368
author	Wang, Fei Chen, Chengcheng Chen, Hongyu Chang, Yugang Zeng, Weiming
author_facet	Wang, Fei Chen, Chengcheng Chen, Hongyu Chang, Yugang Zeng, Weiming
contents	Recently, large language models (LLMs) and vision-language models (VLMs) have achieved significant success, demonstrating remarkable capabilities in understanding various images and videos, particularly in classification and detection tasks. However, due to the substantial differences between remote sensing images and conventional optical images, these models face considerable challenges in comprehension, especially in detection tasks. Directly prompting VLMs with detection instructions often leads to unsatisfactory results. To address this issue, this letter explores the application of VLMs for object detection in remote sensing images. Specifically, we constructed supervised fine-tuning (SFT) datasets using publicly available remote sensing object detection datasets, including SSDD, HRSID, and NWPU-VHR-10. In these new datasets, we converted annotation information into JSON-compliant natural language descriptions, facilitating more effective understanding and training for the VLM. We then evaluate the detection performance of various fine-tuning strategies for VLMs and derive optimized model weights for object detection in remote sensing images. Finally, we evaluate the model's prior knowledge capabilities using natural language queries. Experimental results demonstrate that, without modifying the model architecture, remote sensing object detection can be effectively achieved using natural language alone. Additionally, the model exhibits the ability to perform certain vision question answering (VQA) tasks. Our datasets and related code will be released soon.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_08144
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method Wang, Fei Chen, Chengcheng Chen, Hongyu Chang, Yugang Zeng, Weiming Computer Vision and Pattern Recognition Recently, large language models (LLMs) and vision-language models (VLMs) have achieved significant success, demonstrating remarkable capabilities in understanding various images and videos, particularly in classification and detection tasks. However, due to the substantial differences between remote sensing images and conventional optical images, these models face considerable challenges in comprehension, especially in detection tasks. Directly prompting VLMs with detection instructions often leads to unsatisfactory results. To address this issue, this letter explores the application of VLMs for object detection in remote sensing images. Specifically, we constructed supervised fine-tuning (SFT) datasets using publicly available remote sensing object detection datasets, including SSDD, HRSID, and NWPU-VHR-10. In these new datasets, we converted annotation information into JSON-compliant natural language descriptions, facilitating more effective understanding and training for the VLM. We then evaluate the detection performance of various fine-tuning strategies for VLMs and derive optimized model weights for object detection in remote sensing images. Finally, we evaluate the model's prior knowledge capabilities using natural language queries. Experimental results demonstrate that, without modifying the model architecture, remote sensing object detection can be effectively achieved using natural language alone. Additionally, the model exhibits the ability to perform certain vision question answering (VQA) tasks. Our datasets and related code will be released soon.
title	Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.08144

Similar Items