MassNet: A Foundational Resource for Advancing AI in Proteomics
MassNet is the largest publicly available DDA-based proteomics dataset to date, and the first specifically optimized for AI applications.
~30 TB of raw DDA-MS data
1.54 billion MS/MS spectra;
558 million peptide-spectrum matches (PSMs);
Coverage across 35 species, including animals, plants, and microorganism;
The human subset covers ~98% of annotated proteins;
The release of MassNet marks a new chapter in AI-driven proteomics:
The first foundational training dataset in proteomics, comparable to those in NLP and CV;
Enables AI-driven applications in non-model organism research, biomarker discovery, and PTM identification;
Built on a standardized format and high-performance architecture to support scalable and reproducible proteomics analysis;
MassNet
MSDT (Mass Spectrometry Data Tensor)
MassNet structures raw spectra into 2D tensors, preserving key features like m/z and intensity through a unified data format.
XuanjiNovo
XuanjiNovo is a MassNet-based decoding model integrating multiple core innovations.