Saved in:
Bibliographic Details
Main Authors: Zhang, Song, Chen, Yanlong, Li, Yilin, Chen, Yining, Yi, Zili, Zhang, Xiaowei, Li, Yawei
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.07562
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913102552891392
author Zhang, Song
Chen, Yanlong
Li, Yilin
Chen, Yining
Yi, Zili
Zhang, Xiaowei
Li, Yawei
author_facet Zhang, Song
Chen, Yanlong
Li, Yilin
Chen, Yining
Yi, Zili
Zhang, Xiaowei
Li, Yawei
contents Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model's computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. To provide matching supervision, we construct GeoScale-VQA, a 1.5M-sample scale-layered RS-VQA corpus whose question-answer generation is conditioned on the same physical scalar that drives CS-HLoRA, forming a closed method-data loop. Trained with QLoRA on an 8B backbone, ScaleEarth achieves state-of-the-art results on remote-sensing benchmarks covering diverse Earth-system tasks, including XLRS-Bench and OmniEarth-Bench.
format Preprint
id arxiv_https___arxiv_org_abs_2605_07562
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs
Zhang, Song
Chen, Yanlong
Li, Yilin
Chen, Yining
Yi, Zili
Zhang, Xiaowei
Li, Yawei
Computer Vision and Pattern Recognition
Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model's computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. To provide matching supervision, we construct GeoScale-VQA, a 1.5M-sample scale-layered RS-VQA corpus whose question-answer generation is conditioned on the same physical scalar that drives CS-HLoRA, forming a closed method-data loop. Trained with QLoRA on an 8B backbone, ScaleEarth achieves state-of-the-art results on remote-sensing benchmarks covering diverse Earth-system tasks, including XLRS-Bench and OmniEarth-Bench.
title Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2605.07562