Run-Level FDR-Filtered MSDT
All files are provided on a per-file basis and reflect only run-level FDR control as applied by individual search engines. No global FDR calibration is performed. This design is intended to provide transparent access to native search engine outputs, rather than globally calibrated confidence estimates.
FragPipe-derived MSDT
scan (int64) Scan number of the MS/MS spectrum used for peptide identification
precursor_mz (float64) Observed m/z value of the peptide precursor ion
precursor_intensity (float64) Observed intensity of the peptide precursor ion
rt (float64) Retention time of the peptide precursor in the LC-MS run
mz_array (float32 ndarray) List of fragment ion m/z values in the MS/MS spectrum
intensity_array (float32 ndarray) List of corresponding ion intensities for the fragment ions
label (int64) Target/decoy label assigned to the PSM (1 for target, –1 for decoy)
charge (int64) Peptide precursor charge state(s)
ExpMass (float64) Experimental precursor mass (observed peptide mass)
rank (int64) Rank of the PSM among all candidate matches for a given spectrum (1 = best)
isotope_errors (int64) Isotopic mass offsets allowed for precursor ion matching (e.g., 0, 1, 2)
hyperscore (float64) Similarity score between observed and theoretical spectra, higher values indicate greater similarity
delta_hyperscore (float64) Difference in Hyperscore between the top and second-best matches for a given spectrum
matched_ion_num (int64) Number of fragment ions that successfully matched theoretical ions
ion_series (int64) Type(s) of fragment ions matched (e.g., b, y, or both)
unweighted_spectral_entropy (float64) Entropy-based score describing the distribution of fragment ion intensities within a spectrum; lower values indicate more concentrated (higher-quality) spectra, while higher values suggest more uniform or noisy ion distributions.
delta_RT_loess (float64) Retention time deviation between observed and predicted RT after LOESS correction
precursor_sequence (str) Peptide amino acid sequence with modifications
proteins (str) List of protein accessions containing this peptide sequence
The 100M training set and the 15-species benchmark dataset were constructed from FragPipe-derived MSDT. Only fields required for model training were retained, including precursor_mz, precursor_charge (corresponding to the charge field in the FragPipe-derived MSDT), mz_array, intensity_array, and pep (corresponding to the precursor_sequence field in the FragPipe-derived MSDT).
Sage-derived MSDT
During database searching with Sage, up to ten candidate peptide–spectrum matches (PSMs) are retained for each spectrum. In the Sage-derived MSDT files, all retained candidate matches are preserved to ensure comprehensive representation of search results and to support flexible downstream re-scoring and modeling. To maximize storage efficiency and eliminate redundant spectral duplication, each spectrum is recorded only once as a unique spectral entry. All candidate peptide matches associated with the same spectrum are consolidated within a single record rather than duplicated across multiple rows. This structure maintains data completeness while ensuring compactness, consistency, and AI-ready usability.
scan (int64) Spectrum scan number used for PSM scoring
precursor_sequence (ndarray) Identified peptide sequence (no modifications included)
proteins (ndarray) Protein accessions or headers corresponding to the peptide
label (int8 ndarray) Binary label (1 = target, –1 = decoy) used for model training or FDR estimation
charge (int8 ndarray) Precursor charge state
matched_peaks (int32 ndarray) Number of fragment ions matched between experimental and theoretical spectra
peptide_q (float32 ndarray) q-value for peptide-level FDR estimation
protein_q (float32 ndarray) q-value for protein-level FDR estimation
predicted_rt (float32 ndarray) Predicted retention time
ion_mobility (float32 ndarray) Ion mobility value (if available from the instrument)
delta_rt (float32 ndarray) Retention time deviation between predicted and observed RT
spectrum_q (float32 ndarray) q-value at the PSM (spectrum) level
sage_discriminant_score (float32 ndarray) Discriminant score (e.g., linear combination of features) used by Sage for classification
precursor_mz (float64) Observed precursor ion m/z
rt (float64) Retention time of precursor in LC-MS run
mz_array (float32 ndarray) List of fragment ion m/z values
intensity_array (float32 ndarray) List of fragment ion intensities corresponding to mz_array
mgf-derived MSDT
mz (float32) m/z values of fragment ions in the MS/MS spectrum
intensity (float32) Corresponding ion intensities for each m/z value
TITLE (str) Unique identifier or description for the spectrum (often includes file name, scan number, or charge)
PEPMASS (float64) Precursor ion m/z and (optionally) intensity
CHARGE (str) Precursor charge state (e.g., 2+, 3+)
INSTRUMENT (str) Type or model of the mass spectrometer used to acquire the data
RTINSECONDS (float64) Retention time of the precursor ion in seconds