Saved in:
| Main Author: | |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.07469 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914313894100992 |
|---|---|
| author | Riggi, S. |
| author_facet | Riggi, S. |
| contents | Vision-language models (VLMs) have recently shown promise in general-purpose reasoning tasks, yet their applicability to domain-specific scientific workflows remains largely unexplored. In this work, we evaluated a series of open-weight and commercial VLMs on six tasks relevant to radio astronomy, such as source morphology classification. We also introduced radio-llava, a fine-tuned multimodal assistant built on the LLaVA architecture and adapted for the radio domain through instruction fine-tuning.
In zero-shot mode, commercial models like GPT-4.1 outperform open-weight VLMs on most radio benchmarks. However, radio-llava significantly improves upon both base LLaVA and commercial models across nearly all tasks. Despite these gains, specialized vision-only models still deliver substantially better performance across the board. Additionally, we observed that fine-tuning introduces catastrophic forgetting on general multimodal tasks, with performance drops up to 40% that can be partly mitigated with a more diverse training dataset or shallow fine-tuning. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_07469 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Toward Vision-Language Assistants for Radio Astronomical Source Analysis Riggi, S. Instrumentation and Methods for Astrophysics Vision-language models (VLMs) have recently shown promise in general-purpose reasoning tasks, yet their applicability to domain-specific scientific workflows remains largely unexplored. In this work, we evaluated a series of open-weight and commercial VLMs on six tasks relevant to radio astronomy, such as source morphology classification. We also introduced radio-llava, a fine-tuned multimodal assistant built on the LLaVA architecture and adapted for the radio domain through instruction fine-tuning. In zero-shot mode, commercial models like GPT-4.1 outperform open-weight VLMs on most radio benchmarks. However, radio-llava significantly improves upon both base LLaVA and commercial models across nearly all tasks. Despite these gains, specialized vision-only models still deliver substantially better performance across the board. Additionally, we observed that fine-tuning introduces catastrophic forgetting on general multimodal tasks, with performance drops up to 40% that can be partly mitigated with a more diverse training dataset or shallow fine-tuning. |
| title | Toward Vision-Language Assistants for Radio Astronomical Source Analysis |
| topic | Instrumentation and Methods for Astrophysics |
| url | https://arxiv.org/abs/2602.07469 |