Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Riggi, S.
Format:	Preprint
Published:	2026
Subjects:	Instrumentation and Methods for Astrophysics
Online Access:	https://arxiv.org/abs/2602.07469
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914313894100992
author	Riggi, S.
author_facet	Riggi, S.
contents	Vision-language models (VLMs) have recently shown promise in general-purpose reasoning tasks, yet their applicability to domain-specific scientific workflows remains largely unexplored. In this work, we evaluated a series of open-weight and commercial VLMs on six tasks relevant to radio astronomy, such as source morphology classification. We also introduced radio-llava, a fine-tuned multimodal assistant built on the LLaVA architecture and adapted for the radio domain through instruction fine-tuning. In zero-shot mode, commercial models like GPT-4.1 outperform open-weight VLMs on most radio benchmarks. However, radio-llava significantly improves upon both base LLaVA and commercial models across nearly all tasks. Despite these gains, specialized vision-only models still deliver substantially better performance across the board. Additionally, we observed that fine-tuning introduces catastrophic forgetting on general multimodal tasks, with performance drops up to 40% that can be partly mitigated with a more diverse training dataset or shallow fine-tuning.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_07469
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Toward Vision-Language Assistants for Radio Astronomical Source Analysis Riggi, S. Instrumentation and Methods for Astrophysics Vision-language models (VLMs) have recently shown promise in general-purpose reasoning tasks, yet their applicability to domain-specific scientific workflows remains largely unexplored. In this work, we evaluated a series of open-weight and commercial VLMs on six tasks relevant to radio astronomy, such as source morphology classification. We also introduced radio-llava, a fine-tuned multimodal assistant built on the LLaVA architecture and adapted for the radio domain through instruction fine-tuning. In zero-shot mode, commercial models like GPT-4.1 outperform open-weight VLMs on most radio benchmarks. However, radio-llava significantly improves upon both base LLaVA and commercial models across nearly all tasks. Despite these gains, specialized vision-only models still deliver substantially better performance across the board. Additionally, we observed that fine-tuning introduces catastrophic forgetting on general multimodal tasks, with performance drops up to 40% that can be partly mitigated with a more diverse training dataset or shallow fine-tuning.
title	Toward Vision-Language Assistants for Radio Astronomical Source Analysis
topic	Instrumentation and Methods for Astrophysics
url	https://arxiv.org/abs/2602.07469

Similar Items