Saved in:
Bibliographic Details
Main Authors: Sawhney, Rajan, Ferrell, Barbra D, Dejean, Thibaut, Schreiber, Zachary, Harrigan, William, Polson, Shawn W, Wommack, K Eric, Belcaid, Mahdi
Format: Artículo científico
Language:en
Published: PeerJ 2025
Subjects:
Online Access:https://pubmed.ncbi.nlm.nih.gov/40985030/
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Fine-tuning protein language models unlocks the potential of underrepresented viral proteomes. Sawhney, Rajan Ferrell, Barbra D Dejean, Thibaut Schreiber, Zachary Harrigan, William Polson, Shawn W Wommack, K Eric Belcaid, Mahdi Viral Proteins Proteome Computational Biology Viruses Protein language models (pLMs) have revolutionized computational biology by generating rich protein vector representations, or embeddings-enabling major advancements in protein design, structure prediction, variant effect analysis, and evolutionary studies. Despite these breakthroughs, current pLMs often exhibit biases against proteins from underrepresented species, with viral proteins being particularly affected, frequently referred to as the "dark matter" of the biological world due to their vast diversity and ubiquity, yet sparse representation in training datasets. Here, we show that fine-tuning pre-trained pLMs on viral protein sequences, using diverse learning frameworks and parameter-efficient strategies, significantly enhances representation quality and improves performance on downstream tasks. To support further research, we provide source code for fine-tuning pLMs and benchmarking embedding quality. By enabling more accurate modeling of viral proteins, our approach advances tools for understanding viral biology, combating emerging infectious diseases, and driving biotechnological innovation.