Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Felipe, Luis, Garcia, Carlos, Naqa, Issam El, Shotande, Monique, Tripathi, Aakash, Rudrapatna, Vivek, Rasool, Ghulam, Bitterman, Danielle, Valdes, Gilmer
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2504.02874
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916673462730752
author	Felipe, Luis Garcia, Carlos Naqa, Issam El Shotande, Monique Tripathi, Aakash Rudrapatna, Vivek Rasool, Ghulam Bitterman, Danielle Valdes, Gilmer
author_facet	Felipe, Luis Garcia, Carlos Naqa, Issam El Shotande, Monique Tripathi, Aakash Rudrapatna, Vivek Rasool, Ghulam Bitterman, Danielle Valdes, Gilmer
contents	The need for robust and diverse data sets to train clinical large language models (cLLMs) is critical given that currently available public repositories often prove too limited in size or scope for comprehensive medical use. While resources like PubMed provide foundational medical literature, they capture only a narrow range of formal publications and omit the broader medical discourse on the internet. To address these deficits, we introduce TheBlueScrubs-v1, a curated dataset of over 25 billion medical tokens - nearly three times larger than PubMed - drawn from a broad-scale internet corpus. Our two-stage filtering pipeline employs a Logistic Regression model for document screening (achieving an AUC of approximately 0.95 on external validation), followed by verification via a 70B-parameter Llama 3.1 instruct model. Each text is assigned three LLM-based quality scores encompassing medical relevance, precision and factual detail, and safety and ethical standards. Clinician reviews confirm high concordance with these automated evaluations, and a specialized cancer classifier further labels approximately 11 billion oncology tokens. Two demonstration tasks highlight the dataset's practical value: first, we distill the safety evaluations to a smaller BERT-style model that reaches an AUC near 0.96 on unseen data; second, we fine-tune a compact LLM on a filtered subset, showing measurable improvements over standard baselines in medical benchmarks as well as private ones. This Data Descriptor details the dataset's creation and validation, underscoring its potential utility for medical AI research.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_02874
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	TheBlueScrubs-v1, a comprehensive curated medical dataset derived from the internet Felipe, Luis Garcia, Carlos Naqa, Issam El Shotande, Monique Tripathi, Aakash Rudrapatna, Vivek Rasool, Ghulam Bitterman, Danielle Valdes, Gilmer Computation and Language The need for robust and diverse data sets to train clinical large language models (cLLMs) is critical given that currently available public repositories often prove too limited in size or scope for comprehensive medical use. While resources like PubMed provide foundational medical literature, they capture only a narrow range of formal publications and omit the broader medical discourse on the internet. To address these deficits, we introduce TheBlueScrubs-v1, a curated dataset of over 25 billion medical tokens - nearly three times larger than PubMed - drawn from a broad-scale internet corpus. Our two-stage filtering pipeline employs a Logistic Regression model for document screening (achieving an AUC of approximately 0.95 on external validation), followed by verification via a 70B-parameter Llama 3.1 instruct model. Each text is assigned three LLM-based quality scores encompassing medical relevance, precision and factual detail, and safety and ethical standards. Clinician reviews confirm high concordance with these automated evaluations, and a specialized cancer classifier further labels approximately 11 billion oncology tokens. Two demonstration tasks highlight the dataset's practical value: first, we distill the safety evaluations to a smaller BERT-style model that reaches an AUC near 0.96 on unseen data; second, we fine-tune a compact LLM on a filtered subset, showing measurable improvements over standard baselines in medical benchmarks as well as private ones. This Data Descriptor details the dataset's creation and validation, underscoring its potential utility for medical AI research.
title	TheBlueScrubs-v1, a comprehensive curated medical dataset derived from the internet
topic	Computation and Language
url	https://arxiv.org/abs/2504.02874

Similar Items