Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Zhang, Ziheng, Ma, Xinyue, Chowdhury, Arpita, Campolongo, Elizabeth G., Thompson, Matthew J., Zhang, Net, Stevens, Samuel, Lapp, Hilmar, Berger-Wolf, Tanya, Su, Yu, Chao, Wei-Lun, Gu, Jianyang
Format:	Preprint
Publié:	2025
Sujets:	Computer Vision and Pattern Recognition Computation and Language Machine Learning
Accès en ligne:	https://arxiv.org/abs/2510.20095
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866915825244438528
author	Zhang, Ziheng Ma, Xinyue Chowdhury, Arpita Campolongo, Elizabeth G. Thompson, Matthew J. Zhang, Net Stevens, Samuel Lapp, Hilmar Berger-Wolf, Tanya Su, Yu Chao, Wei-Lun Gu, Jianyang
author_facet	Zhang, Ziheng Ma, Xinyue Chowdhury, Arpita Campolongo, Elizabeth G. Thompson, Matthew J. Zhang, Net Stevens, Samuel Lapp, Hilmar Berger-Wolf, Tanya Su, Yu Chao, Wei-Lun Gu, Jianyang
contents	This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BioCAP (i.e., BioCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_20095
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models Zhang, Ziheng Ma, Xinyue Chowdhury, Arpita Campolongo, Elizabeth G. Thompson, Matthew J. Zhang, Net Stevens, Samuel Lapp, Hilmar Berger-Wolf, Tanya Su, Yu Chao, Wei-Lun Gu, Jianyang Computer Vision and Pattern Recognition Computation and Language Machine Learning This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BioCAP (i.e., BioCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models.
title	BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
topic	Computer Vision and Pattern Recognition Computation and Language Machine Learning
url	https://arxiv.org/abs/2510.20095

Documents similaires