AI Proteomics Competition (AIPC) Series – Guomics 蛋白质组大数据实验室

对不起，此内容只适用于EN。

AIPC (Artificial Intelligence Proteomics Competition) series focuses on leveraging AI to uncover the ‘dark matter’ of MS-based proteomics.

1. Competition task

This competition will provide a set of proteomics data acquired using the DDA (Data-Dependent Acquisition) technique, scored and annotated by database search engines Sage and FragPipe. Participants are required to leverage AI technologies to:

Improve the accuracy and depth of peptide identification: By optimizing the PSM scoring system through AI methods and rescoring PSMs, the goal is to distinguish “true peptide matches (target)” from “false matches (decoy),” thereby enhancing the recognition of true peptides in DDA data.
Promote algorithmic innovation and cross-dataset generalization: Teams are encouraged to design models that account for data from multiple sources, diverse instrument platforms, and varying experimental conditions, thereby enhancing robustness and adaptability in heterogeneous settings. This will further advance the practical application of AI algorithms in real-world proteomics research.

2. Participants

Students, researchers, bioinformaticians, computational biologists, AI practitioners, and anyone worldwide interested in applying AI to proteomics.

3. Sponsors

4. MSDT Dataset

The dataset used in this competition is the MSDT dataset provided by Guomics, and the competition features two tracks: the Foundational Track and the Advanced Track.

The Foundational Track includes approximately 45 million PSMs, generated from instruments produced by three mass spectrometry vendors: Bruker (16 million), SCIEX (13 million), and Thermo (16 million). The Advanced Track includes approximately 300 million PSMs, all derived from Thermo instruments.

For each raw mass spectrometry file in the dataset, three corresponding files are provided:

The original spectrum data (*_rawspectrum.tsv)
Search results from Sage (*_sage.parquet)
Search results from FragPipe (*_fp.parquet)

For more detailed data specifications, please refer to the documentation to be released later.

5. Baseline Model

We will offer a PSM-rescoring baseline to help participants know how to use MSDT files and train a base model on GPUs.

6. Evaluation Metrics

Participating teams are required to submit their model’s PSM rescoring results for each test file. The organizers will assess the model’s capability to control false positives following the method proposed by Wen et al. (see “FDP Assessment” in the reference materials), whereby submissions in which the false discovery proportion (FDP) exceeds 1% at a 1% false discovery rate (FDR) threshold will be deemed to have failed the false discovery control criterion and will be excluded from further evaluation. For submissions that satisfy this stringent FDR control, the organizers will apply the Target-Decoy Competition (TDC) strategy (see “Target-Decoy Strategy” in the reference materials) to quantify the number of peptide identifications achieved in each test file under the condition of FDR < 1%, and the final score will be determined by aggregating the total number of peptide identifications across all test files, with higher numbers of identifications yielding higher scores, thereby reflecting the model’s identification performance and spectrum coverage in complex proteomic samples.

7. Competition Rules

7.1 Team Formation:

Each team must have a minimum of one and a maximum of five members.
Code sharing between teams is strictly prohibited—violators will be disqualified.

7.2 Submission Rules:

Each team may submit up to 2 times per day, with a total limit of 180 submissions throughout the competition.
Submissions that do not comply with the specifications will be deemed invalid.
Invalid submissions do not count toward the total submission limit.

7.3 Ranking Rules:

A/B Leaderboard Scoring:
- The test set is split into A and B leaderboards.
- A leaderboard contributes 40% to the final score and will be updated in real time with visible rankings.
- B leaderboard contributes 60% to the final score and will be revealed after the submission deadline.
- Final awards (First, Second, and Third Prizes) will be determined based on the combined A/B leaderboard scores announced one day after the submission deadline.
Ranking Criteria:
- Primary ranking factor: Score
- If scores are tied: Fewer submissions ranks higher
- If both score and submission counts are tied: Earlier submission time ranks higher
Final Submission Selection:
- Each team leader may designate two final submissions to be evaluated on the B leaderboard.

8. Anticipated Competition Duration

Registration: September 10th, 2025 – March 9th, 2026

Competition duration: October 8th, 2025 – April 7th, 2026 (approximately 180 days)

9. Awards & Prizes

First Prize: $5,000 (each track)

Second Prize: $1,500 (each track)

Third Prize: $500 (each track)

10. Additional Benefits

Computing Credit Rewards: Every registered participant will receive a ¥100 Bohrium® computing credit.

Best Notebook Award: Participants must submit complete code using the Bohrium Notebook platform. We encourage contestants to publish relevant content in Notebook format on the Case Plaza with the tag AI4SCUP-AIPC. The top three notebooks with the most likes will each receive a ¥1,000 computing credit.

Internship Opportunities: Outstanding participants may be recommended for internship opportunities at relevant institutions, gaining access to top-tier research resources and networking with leading interdisciplinary experts.

Invited tour in participating institutes.

11. Q&A

Q: What is PSM rescoring and why is it important?

A: PSM (Peptide-Spectrum Match) rescoring is the process of re-evaluating and improving the confidence of initial matches between tandem mass spectra and peptide sequences. In bottom-up proteomics, search engines generate candidate PSMs, but their scoring methods can be limited or biased. Rescoring leverages machine learning models to incorporate additional features and improve the discrimination between true and false matches. This leads to higher identification accuracy and deeper proteome coverage.

Q: Can I participate if I don’t have much computing resources?

A: Yes. Participants with limited computing resources can apply for support after submitting a few evaluation results. Based on performance, we will provide computing resources to promising participants.

Q: Can I use pre-trained models?

A: No. To ensure fairness and emphasize model development skills, participants are required to train their models from scratch using only the data provided. The code of winning teams will be reviewed to verify compliance.

Q: Is it allowed to use external datasets for training?

A: We encourage participants to explore and use external datasets to enhance their models, especially if they can bring new perspectives to PSM rescoring. However, to ensure transparency and reproducibility, all external data sources must be clearly documented and reported in your final submission.

Q: Can I design my own features for rescoring?

A: Absolutely! In fact, feature engineering is a key part of this competition. You are encouraged to extract new features from the PSMs.

Q: When does the competition start?

A: The competition is currently under preparation and registration has not yet opened. Once the competition is officially launched, a registration link will be posted on this website. Please stay tuned!