Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Abdalla, Youssef, Taub, Marrisa, Hilton, Eleanor, Akkaraju, Priya, Milanovic, Alexander, Orlu, Mine, Basit, Abdul W., Cook, Michael T, Chakraborti, Tapabrata, Shorthouse, David
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2501.08995
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929679489826816
author	Abdalla, Youssef Taub, Marrisa Hilton, Eleanor Akkaraju, Priya Milanovic, Alexander Orlu, Mine Basit, Abdul W. Cook, Michael T Chakraborti, Tapabrata Shorthouse, David
author_facet	Abdalla, Youssef Taub, Marrisa Hilton, Eleanor Akkaraju, Priya Milanovic, Alexander Orlu, Mine Basit, Abdul W. Cook, Michael T Chakraborti, Tapabrata Shorthouse, David
contents	Data scarcity in pharmaceutical research has led to reliance on labour-intensive trial-and-error approaches for development rather than data-driven methods. While Machine Learning offers a solution, existing datasets are often small and noisy, limiting their utility. To address this, we developed a Variationally Encoded Conditional Tabular Generative Adversarial Network (VECT-GAN), a novel generative model specifically designed for augmenting small, noisy datasets. We introduce a pipeline where data is augmented before regression model development and demonstrate that this consistently and significantly improves performance over other state-of-the-art tabular generative models. We apply this pipeline across six pharmaceutical datasets, and highlight its real-world applicability by developing novel polymers with medically desirable mucoadhesive properties, which we made and experimentally characterised. Additionally, we pre-train the model on the ChEMBL database of drug-like molecules, leveraging knowledge distillation to enhance its generalisability, making it readily available for use on pharmaceutical datasets containing small molecules, an extremely common pharmaceutical task. We demonstrate the power of synthetic data for regularising small tabular datasets, highlighting its potential to become standard practice in pharmaceutical model development, and make our method, including VECT-GAN pre-trained on ChEMBL available as a pip package.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_08995
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	VECT-GAN: A variationally encoded generative model for overcoming data scarcity in pharmaceutical science Abdalla, Youssef Taub, Marrisa Hilton, Eleanor Akkaraju, Priya Milanovic, Alexander Orlu, Mine Basit, Abdul W. Cook, Michael T Chakraborti, Tapabrata Shorthouse, David Machine Learning Data scarcity in pharmaceutical research has led to reliance on labour-intensive trial-and-error approaches for development rather than data-driven methods. While Machine Learning offers a solution, existing datasets are often small and noisy, limiting their utility. To address this, we developed a Variationally Encoded Conditional Tabular Generative Adversarial Network (VECT-GAN), a novel generative model specifically designed for augmenting small, noisy datasets. We introduce a pipeline where data is augmented before regression model development and demonstrate that this consistently and significantly improves performance over other state-of-the-art tabular generative models. We apply this pipeline across six pharmaceutical datasets, and highlight its real-world applicability by developing novel polymers with medically desirable mucoadhesive properties, which we made and experimentally characterised. Additionally, we pre-train the model on the ChEMBL database of drug-like molecules, leveraging knowledge distillation to enhance its generalisability, making it readily available for use on pharmaceutical datasets containing small molecules, an extremely common pharmaceutical task. We demonstrate the power of synthetic data for regularising small tabular datasets, highlighting its potential to become standard practice in pharmaceutical model development, and make our method, including VECT-GAN pre-trained on ChEMBL available as a pip package.
title	VECT-GAN: A variationally encoded generative model for overcoming data scarcity in pharmaceutical science
topic	Machine Learning
url	https://arxiv.org/abs/2501.08995

Similar Items