MassNET

1. Overview

The Mass Spectrometry DDA Tensor (MSDT) format is a standardized, schema-rich data representation designed to support large-scale, AI-driven proteomics analysis. MSDT provides a compact and structured abstraction of tandem mass spectrometry (MS/MS) data, enabling efficient storage, exchange, and reuse across diverse machine learning tasks, including de novo peptide sequencing, PSM rescoring, and retention time prediction.

MSDT is designed as a logical data representation derived deterministically from raw mass spectrometry data and associated identification results. It is not intended to replace existing raw data standards such as mzML, but to complement them by offering an AI-friendly representation optimized for downstream modeling and benchmarking. MSDT files can be generated using the open-source MSDT-Converter tool (https://github.com/guomics-lab/MSDT-Converter).

2. Scope and Non-goals

2.1 Purpose

MSDT is designed to:

Represent MS/MS spectra in a structured, tabular form suitable for large-scale machine learning.
Serve as a reusable and interoperable data layer between raw data formats and AI models.
Support dataset construction, benchmarking, and cross-project reuse in proteomics.

2.2 Explicit non-goals

MSDT is not intended to:

Replace mzML or vendor-specific raw formats as archival standards.
Preserve all instrument-specific metadata or acquisition parameters.
Serve as a direct execution-time data format for model training or inference under highly demanding data access patterns.

Execution-layer formats (e.g., LMDB) may be derived from MSDT for computational efficiency but are outside the scope of the MSDT specification itself.

3. Conceptual Data Model

At the conceptual level, MSDT represents MS/MS data as a collection of independent spectra, where:

Each row corresponds to a single MS/MS spectrum.
Each spectrum is associated with:

    * Precursor information (m/z, charge, retention time).
    * Fragment ion peak lists.
    * Optional identification-related annotations.

All entries in MSDT are derived deterministically from raw spectra and, where applicable, database search results.

4. Physical Storage Format

MSDT is physically stored using the Apache Parquet format, a columnar data storage standard that provides:

Explicit schema definition and validation.
Efficient compression and encoding.
Fast column-wise access and slicing.

Parquet was selected to enable scalable dataset construction and efficient downstream processing while maintaining strong compatibility with common data processing ecosystems.

5. Versioning and Compatibility

Each MSDT file MUST declare a format version identifier.

The current specification defines MSDT v1.0 .
All files conforming to MSDT v1.x share a stable core schema.
Minor version updates (v1.x → v1.y) may introduce optional fields but MUST NOT break backward compatibility.
Major version updates (e.g., v2.0) may introduce incompatible schema changes and will be explicitly versioned.

Versioning ensures that MSDT datasets remain interpretable and reusable as the format evolves.

6. MSDT v1.0 Schema

The MSDT v1.0 schema consists of a set of required and optional fields describing MS/MS spectra and associated metadata. Further details on the MSDT file format are described below.

Run-Level FDR-Filtered MSDT

Globally FDR-Filtered MSDT

6.1 Required fields

scan_id
precursor_mz
precursor_charge
retention_time
mz_array
intensity_array

6.2 Optional fields

peptide_sequence
modifications
protein_accessions
psm_score
q_value

Optional fields may be absent depending on the dataset construction workflow and intended use.

7. Normalization and Conventions

m/z values are reported in Thomson (Th).
Intensity values are stored as floating-point arrays and may reflect normalized or raw intensities, depending on dataset construction.

Missing or unavailable values are explicitly encoded and do not imply absence of the corresponding entity. All conventions are applied consistently within a given MSDT dataset.

8. Minimal Working Example

A minimal MSDT table consists of:

One row per MS/MS spectrum.
Required precursor metadata.
Fragment ion peak lists stored as array-typed columns.

This minimal representation is sufficient to support a wide range of AI-based proteomics tasks.

9. Relationship to Other Formats

mzML: MSDT is derived from mzML but does not aim to preserve full raw data fidelity.
MGF: MSDT generalizes spectrum-level representations beyond flat text formats.
Execution formats (e.g., LMDB): These may be derived from MSDT for performance reasons but are not part of the MSDT specification.

10. Intended Use and Future Extensions

MSDT is designed to be extensible. New fields may be introduced in future versions to support additional tasks or annotations, provided that backward compatibility guarantees are respected within major versions. The format is intended to serve as a stable foundation for AI-driven proteomics research and benchmarking.