Uloženo v:
Podrobná bibliografie
Hlavní autoři: Zielezinski, Andrzej, Gudyś, Adam, Deorowicz, Sebastian
Médium: Recurso digital
Jazyk:angličtina
Vydáno: Zenodo 2025
Témata:
On-line přístup:https://doi.org/10.5281/zenodo.16082639
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Obsah:
  • <p>This dataset contains 1,166 protein families derived from AlphaFold Database Clusters. The families vary in size, ranging from approximately 1,000 to 680,000 sequences.</p> <p>For each family, the dataset provides:</p> <ul> <li>protein sequences (FASTA format)</li> <li>download URLs for AlphaFold-predicted PDB structures corresponding to each protein sequence</li> </ul> <p>These paired sequences and structures enable structure-based benchmarking of multiple sequence alignment (MSA) tools using the Local Distance Difference Test (LDDT) score, computed with the FoldMason tool.</p> <h3><strong>Directory structure</strong></h3> <p>The dataset contains two main directories:</p> <ul> <li><code>fasta/</code> – protein sequences for each cluster [FASTA format]</li> <li><code>pdb_urls/</code> – text files containing download URLs for AlphaFold PDB structures for each sequence in the cluster [TXT format]</li> </ul> <p>A metadata file (<code>metadata.tsv</code>) is also included, providing detailed information for each cluster.</p> <h3><strong>Metadata</strong></h3> <p>A metadata file (<code>metadata.tsv</code>) provides:</p> <ul> <li><strong>cluster_id</strong> – Cluster identifier</li> <li><strong>seqs_count</strong> – total number of sequences in the cluster</li> <li><strong>min_seq_length</strong> – minimum sequence length within the cluster</li> <li><strong>mean_seq_length</strong> – average sequence length within the cluster</li> <li><strong>max_seq_length</strong> – maximum sequence length within the cluster</li> </ul>