Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xu, Hao, Wang, Zhichao, Sang, Shengqi, Wajanasara, Pisit, Bandeira, Nuno
Format:	Preprint
Published:	2025
Subjects:	Biomolecules Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2508.21076
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909758380834816
author	Xu, Hao Wang, Zhichao Sang, Shengqi Wajanasara, Pisit Bandeira, Nuno
author_facet	Xu, Hao Wang, Zhichao Sang, Shengqi Wajanasara, Pisit Bandeira, Nuno
contents	Proteins perform nearly all cellular functions and constitute most drug targets, making their analysis fundamental to understanding human biology in health and disease. Tandem mass spectrometry (MS$^2$) is the major analytical technique in proteomics that identifies peptides by ionizing them, fragmenting them, and using the resulting mass spectra to identify and quantify proteins in biological samples. In MS$^2$ analysis, peptide fragment ion probability prediction plays a critical role, enhancing the accuracy of peptide identification from mass spectra as a complement to the intensity information. Current approaches rely on global statistics of fragmentation, which assumes that a fragment's probability is uniform across all peptides. Nevertheless, this assumption is oversimplified from a biochemical principle point of view and limits accurate prediction. To address this gap, we present Pep2Prob, the first comprehensive dataset and benchmark designed for peptide-specific fragment ion probability prediction. The proposed dataset contains fragment ion probability statistics for 608,780 unique precursors (each precursor is a pair of peptide sequence and charge state), summarized from more than 183 million high-quality, high-resolution, HCD MS$^2$ spectra with validated peptide assignments and fragmentation annotations. We establish baseline performance using simple statistical rules and learning-based methods, and find that models leveraging peptide-specific information significantly outperform previous methods using only global fragmentation statistics. Furthermore, performance across benchmark models with increasing capacities suggests that the peptide-fragmentation relationship exhibits complex nonlinearities requiring sophisticated machine learning approaches.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_21076
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS$^2$-based Proteomics Xu, Hao Wang, Zhichao Sang, Shengqi Wajanasara, Pisit Bandeira, Nuno Biomolecules Artificial Intelligence Machine Learning Proteins perform nearly all cellular functions and constitute most drug targets, making their analysis fundamental to understanding human biology in health and disease. Tandem mass spectrometry (MS$^2$) is the major analytical technique in proteomics that identifies peptides by ionizing them, fragmenting them, and using the resulting mass spectra to identify and quantify proteins in biological samples. In MS$^2$ analysis, peptide fragment ion probability prediction plays a critical role, enhancing the accuracy of peptide identification from mass spectra as a complement to the intensity information. Current approaches rely on global statistics of fragmentation, which assumes that a fragment's probability is uniform across all peptides. Nevertheless, this assumption is oversimplified from a biochemical principle point of view and limits accurate prediction. To address this gap, we present Pep2Prob, the first comprehensive dataset and benchmark designed for peptide-specific fragment ion probability prediction. The proposed dataset contains fragment ion probability statistics for 608,780 unique precursors (each precursor is a pair of peptide sequence and charge state), summarized from more than 183 million high-quality, high-resolution, HCD MS$^2$ spectra with validated peptide assignments and fragmentation annotations. We establish baseline performance using simple statistical rules and learning-based methods, and find that models leveraging peptide-specific information significantly outperform previous methods using only global fragmentation statistics. Furthermore, performance across benchmark models with increasing capacities suggests that the peptide-fragmentation relationship exhibits complex nonlinearities requiring sophisticated machine learning approaches.
title	Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS$^2$-based Proteomics
topic	Biomolecules Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2508.21076

Similar Items