1. Overview
The Mass Spectrometry DDA Tensor (MSDT) format is a standardized, schema-rich data representation designed to support large-scale, AI-driven proteomics analysis. MSDT provides a compact and structured abstraction of tandem mass spectrometry (MS/MS) data, enabling efficient storage, exchange, and reuse across diverse machine learning tasks, including de novo peptide sequencing, PSM rescoring, and retention time prediction.
MSDT is designed as a logical data representation derived deterministically from raw mass spectrometry data and associated identification results. It is not intended to replace existing raw data standards such as mzML, but to complement them by offering an AI-friendly representation optimized for downstream modeling and benchmarking. MSDT files can be generated using the open-source MSDT-Converter tool (https://github.com/guomics-lab/MSDT-Converter).
2. Scope and Non-goals
2.1 Purpose
MSDT is designed to:
- Represent MS/MS spectra in a structured, tabular form suitable for large-scale machine learning.
- Serve as a reusable and interoperable data layer between raw data formats and AI models.
- Support dataset construction, benchmarking, and cross-project reuse in proteomics.
2.2 Explicit non-goals
MSDT is not intended to:
- Replace mzML or vendor-specific raw formats as archival standards.
- Preserve all instrument-specific metadata or acquisition parameters.
- Serve as a direct execution-time data format for model training or inference under highly demanding data access patterns.
Execution-layer formats (e.g., LMDB) may be derived from MSDT for computational efficiency but are outside the scope of the MSDT specification itself.
3. Conceptual Data Model
At the conceptual level, MSDT represents MS/MS data as a collection of independent spectra, where:- Each row corresponds to a single MS/MS spectrum.
- Each spectrum is associated with:
* Precursor information (m/z, charge, retention time).
* Fragment ion peak lists.
* Optional identification-related annotations.
All entries in MSDT are derived deterministically from raw spectra and, where applicable, database search results.
4. Physical Storage Format
- Explicit schema definition and validation.
- Efficient compression and encoding.
- Fast column-wise access and slicing.
5. Versioning and Compatibility
- The current specification defines MSDT v1.0 .
- All files conforming to MSDT v1.x share a stable core schema.
- Minor version updates (v1.x → v1.y) may introduce optional fields but MUST NOT break backward compatibility.
- Major version updates (e.g., v2.0) may introduce incompatible schema changes and will be explicitly versioned.
6. MSDT v1.0 Schema
The MSDT v1.0 schema consists of a set of required and optional fields describing MS/MS spectra and associated metadata. For detailed annotations and documentation of the MSDT (Mass Spectrometry Data Table) format, please refer to: https://www.guomics.com/software/massnet/exp.html
6.1 Required fields
- scan_id
- precursor_mz
- precursor_charge
- retention_time
- mz_array
- intensity_array
6.2 Optional fields
- peptide_sequence
- modifications
- protein_accessions
- psm_score
- q_value
7. Normalization and Conventions
- m/z values are reported in Thomson (Th).
- Intensity values are stored as floating-point arrays and may reflect normalized or raw intensities, depending on dataset construction.
8. Minimal Working Example
A minimal MSDT table consists of:- One row per MS/MS spectrum.
- Required precursor metadata.
- Fragment ion peak lists stored as array-typed columns.
9. Relationship to Other Formats
- mzML: MSDT is derived from mzML but does not aim to preserve full raw data fidelity.
- MGF: MSDT generalizes spectrum-level representations beyond flat text formats.
- Execution formats (e.g., LMDB): These may be derived from MSDT for performance reasons but are not part of the MSDT specification.