Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Song, Chen, Yanlong, Li, Yilin, Chen, Yining, Yi, Zili, Zhang, Xiaowei, Li, Yawei
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.07562
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913102552891392
author	Zhang, Song Chen, Yanlong Li, Yilin Chen, Yining Yi, Zili Zhang, Xiaowei Li, Yawei
author_facet	Zhang, Song Chen, Yanlong Li, Yilin Chen, Yining Yi, Zili Zhang, Xiaowei Li, Yawei
contents	Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model's computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. To provide matching supervision, we construct GeoScale-VQA, a 1.5M-sample scale-layered RS-VQA corpus whose question-answer generation is conditioned on the same physical scalar that drives CS-HLoRA, forming a closed method-data loop. Trained with QLoRA on an 8B backbone, ScaleEarth achieves state-of-the-art results on remote-sensing benchmarks covering diverse Earth-system tasks, including XLRS-Bench and OmniEarth-Bench.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_07562
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs Zhang, Song Chen, Yanlong Li, Yilin Chen, Yining Yi, Zili Zhang, Xiaowei Li, Yawei Computer Vision and Pattern Recognition Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model's computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. To provide matching supervision, we construct GeoScale-VQA, a 1.5M-sample scale-layered RS-VQA corpus whose question-answer generation is conditioned on the same physical scalar that drives CS-HLoRA, forming a closed method-data loop. Trained with QLoRA on an 8B backbone, ScaleEarth achieves state-of-the-art results on remote-sensing benchmarks covering diverse Earth-system tasks, including XLRS-Bench and OmniEarth-Bench.
title	Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2605.07562

Similar Items