Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Accordi, Gianmarco, Gadioli, Davide, Seguini, Giorgio, Beccari, Andrea R., Palermo, Gianluca
Formato:	Preprint
Publicado:	2024
Materias:	Computational Engineering, Finance, and Science
Acceso en línea:	https://arxiv.org/abs/2404.19391
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866910428659974144
author	Accordi, Gianmarco Gadioli, Davide Seguini, Giorgio Beccari, Andrea R. Palermo, Gianluca
author_facet	Accordi, Gianmarco Gadioli, Davide Seguini, Giorgio Beccari, Andrea R. Palermo, Gianluca
contents	Virtual screening is a technique used in drug discovery to select the most promising molecules to test in a lab. To perform virtual screening, we need a large set of molecules as input, and storing these molecules can become an issue. In fact, extreme-scale high-throughput virtual screening applications require a big dataset of input molecules and produce an even bigger dataset as output. These molecules' databases occupy tens of TB of storage space, and domain experts frequently sample a small portion of this data. In this context, SMILES is a popular data format for storing large sets of molecules since it requires significantly less space to represent molecules than other formats (e.g., MOL2, SDF). This paper proposes an efficient dictionary-based approach to compress SMILES-based datasets. This approach takes advantage of domain knowledge to provide a readable output with separable SMILES, enabling random access. We examine the benefits of storing these datasets using ZSMILES to reduce the cold storage footprint in HPC systems. The main contributions concern a custom dictionary-based approach and a data pre-processing step. From experimental results, we can notice how ZSMILES leverage domain knowledge to compress x1.13 more than state of the art in similar scenarios and up to $0.29$ compression ratio. We tested a CUDA version of ZSMILES targetting NVIDIA's GPUs, showing a potential speedup of 7x.
format	Preprint
id	arxiv_https___arxiv_org_abs_2404_19391
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	ZSMILES: an approach for efficient SMILES storage for random access in Virtual Screening Accordi, Gianmarco Gadioli, Davide Seguini, Giorgio Beccari, Andrea R. Palermo, Gianluca Computational Engineering, Finance, and Science Virtual screening is a technique used in drug discovery to select the most promising molecules to test in a lab. To perform virtual screening, we need a large set of molecules as input, and storing these molecules can become an issue. In fact, extreme-scale high-throughput virtual screening applications require a big dataset of input molecules and produce an even bigger dataset as output. These molecules' databases occupy tens of TB of storage space, and domain experts frequently sample a small portion of this data. In this context, SMILES is a popular data format for storing large sets of molecules since it requires significantly less space to represent molecules than other formats (e.g., MOL2, SDF). This paper proposes an efficient dictionary-based approach to compress SMILES-based datasets. This approach takes advantage of domain knowledge to provide a readable output with separable SMILES, enabling random access. We examine the benefits of storing these datasets using ZSMILES to reduce the cold storage footprint in HPC systems. The main contributions concern a custom dictionary-based approach and a data pre-processing step. From experimental results, we can notice how ZSMILES leverage domain knowledge to compress x1.13 more than state of the art in similar scenarios and up to $0.29$ compression ratio. We tested a CUDA version of ZSMILES targetting NVIDIA's GPUs, showing a potential speedup of 7x.
title	ZSMILES: an approach for efficient SMILES storage for random access in Virtual Screening
topic	Computational Engineering, Finance, and Science
url	https://arxiv.org/abs/2404.19391

Ejemplares similares