Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Hu, Yu, Gu, Jianyang, Liu, Hao, Cao, Yue, Hamari, Jozsef, Liu, Zheng, Zardadi, Mohsen
Format: Preprint
Veröffentlicht: 2026
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2603.12659
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866908883600015360
author Hu, Yu
Gu, Jianyang
Liu, Hao
Cao, Yue
Hamari, Jozsef
Liu, Zheng
Zardadi, Mohsen
author_facet Hu, Yu
Gu, Jianyang
Liu, Hao
Cao, Yue
Hamari, Jozsef
Liu, Zheng
Zardadi, Mohsen
contents Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.
format Preprint
id arxiv_https___arxiv_org_abs_2603_12659
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network
Hu, Yu
Gu, Jianyang
Liu, Hao
Cao, Yue
Hamari, Jozsef
Liu, Zheng
Zardadi, Mohsen
Computer Vision and Pattern Recognition
Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.
title AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.12659