Obsah: :: Library Catalog

Uloženo v:

Podrobná bibliografie
Hlavní autoři:	Zielezinski, Andrzej, Gudyś, Adam, Deorowicz, Sebastian
Médium:	Recurso digital
Jazyk:	angličtina
Vydáno:	Zenodo 2025
Témata:	multiple sequence alignment protein sequences AlphaFold protein structures benchmark
On-line přístup:	https://doi.org/10.5281/zenodo.16082639
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Obsah:

This dataset contains 1,166 protein families derived from AlphaFold Database Clusters. The families vary in size, ranging from approximately 1,000 to 680,000 sequences. For each family, the dataset provides: <ul> <li>protein sequences (FASTA format)</li> <li>download URLs for AlphaFold-predicted PDB structures corresponding to each protein sequence</li> </ul> These paired sequences and structures enable structure-based benchmarking of multiple sequence alignment (MSA) tools using the Local Distance Difference Test (LDDT) score, computed with the FoldMason tool. <h3>Directory structure</h3> The dataset contains two main directories: <ul> <li><code>fasta/</code> – protein sequences for each cluster [FASTA format]</li> <li><code>pdb_urls/</code> – text files containing download URLs for AlphaFold PDB structures for each sequence in the cluster [TXT format]</li> </ul> A metadata file (<code>metadata.tsv</code>) is also included, providing detailed information for each cluster. <h3>Metadata</h3> A metadata file (<code>metadata.tsv</code>) provides: <ul> <li>cluster_id – Cluster identifier</li> <li>seqs_count – total number of sequences in the cluster</li> <li>min_seq_length – minimum sequence length within the cluster</li> <li>mean_seq_length – average sequence length within the cluster</li> <li>max_seq_length – maximum sequence length within the cluster</li> </ul>

Podobné jednotky