Saved in:
Bibliographic Details
Main Author: Riggi, S.
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.07469
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914313894100992
author Riggi, S.
author_facet Riggi, S.
contents Vision-language models (VLMs) have recently shown promise in general-purpose reasoning tasks, yet their applicability to domain-specific scientific workflows remains largely unexplored. In this work, we evaluated a series of open-weight and commercial VLMs on six tasks relevant to radio astronomy, such as source morphology classification. We also introduced radio-llava, a fine-tuned multimodal assistant built on the LLaVA architecture and adapted for the radio domain through instruction fine-tuning. In zero-shot mode, commercial models like GPT-4.1 outperform open-weight VLMs on most radio benchmarks. However, radio-llava significantly improves upon both base LLaVA and commercial models across nearly all tasks. Despite these gains, specialized vision-only models still deliver substantially better performance across the board. Additionally, we observed that fine-tuning introduces catastrophic forgetting on general multimodal tasks, with performance drops up to 40% that can be partly mitigated with a more diverse training dataset or shallow fine-tuning.
format Preprint
id arxiv_https___arxiv_org_abs_2602_07469
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Toward Vision-Language Assistants for Radio Astronomical Source Analysis
Riggi, S.
Instrumentation and Methods for Astrophysics
Vision-language models (VLMs) have recently shown promise in general-purpose reasoning tasks, yet their applicability to domain-specific scientific workflows remains largely unexplored. In this work, we evaluated a series of open-weight and commercial VLMs on six tasks relevant to radio astronomy, such as source morphology classification. We also introduced radio-llava, a fine-tuned multimodal assistant built on the LLaVA architecture and adapted for the radio domain through instruction fine-tuning. In zero-shot mode, commercial models like GPT-4.1 outperform open-weight VLMs on most radio benchmarks. However, radio-llava significantly improves upon both base LLaVA and commercial models across nearly all tasks. Despite these gains, specialized vision-only models still deliver substantially better performance across the board. Additionally, we observed that fine-tuning introduces catastrophic forgetting on general multimodal tasks, with performance drops up to 40% that can be partly mitigated with a more diverse training dataset or shallow fine-tuning.
title Toward Vision-Language Assistants for Radio Astronomical Source Analysis
topic Instrumentation and Methods for Astrophysics
url https://arxiv.org/abs/2602.07469