NPS

NPS is an approach to scoring (NPScore) and evaluating the statistical significance (NPSignificance) of peptidic natural product-spectrum matches. It is embedded into Dereplicator and VarQuest pipelines.

The method takes into account intensities of MS/MS peaks and occurrence of various additional ions during the fragmentation in mass spectrometers. The weights for scoring annotated and missed peaks are statistically learned. NPS is a replacement for the naive Shared Peak Count (SPC) scoring method which was the baseline before.

We compared the performance of the default Dereplicator/VarQuest (SPC-based) with the new versions (NPS-based) on various datasets including both regular linear peptides (as a sanity check) and peptidic natural products (PNPs, the target type of data).

The paper about NPS was accepted to ISMB/ECCB 2019 proceedings track (CompMS COSI) and was presented on July 23 in Basel. The paper is published in a special ISMB/ECCB issue of Bioinformatics journal.  You may also be interested in the corresponding slides (presented on BiATA 2019 conference as well) and poster (J-09 on Jule 23).

Benchmarking results

Proteomics data

The first benchmarking dataset is a subset of the Human Proteome Map project (Kim et al., 2014). The full dataset is freely accessible on GNPS (accession MSV000079514) and contains approximately 25 million high-resolution MS/MS spectra. These spectra were obtained on LTQ- Orbitrap Velos and LTQ-Orbitrap Elite mass spectrometers from proteins of 30 organ tissues. For training the model we used spectra from heart (426, 086 spectra), for testing the model we used spectra from kidney (439,253 spectra). The target peptide database was obtained from the Human RefSeq proteins (Pruitt et al., 2005) and contains 47,284 peptides with sequence length from 8 to 20 amino acids.

Dereplicator results in two modes (NPS and SPC) were also compared against one of the state-of-the-art proteomics tool, MS-GF+ (Kim and Pevzner, 2014).

The figure demonstrates the number of identified Peptide-Spectrum Matches (PSMs) at a given False Discovery Rate level (% FDR). FDR is estimated using the target-decoy approach (Elias and Gygi, 2007).

NPS method identified 6% more PSMs than the baseline SPC approach at strict 1% FDR (100,444 vs 94,663 PSMs). The small improvement on this dataset may be due to the relative simplicity of the peptide identification from high-quality data. Note that MS-GF+ obviously outperformed both Dereplicator approaches at all FDR levels. However, the beating of one of the leading proteomics tools on its own ground is clearly out of the scope of NPS development. Note that impressive MS- GF+ results are partially based on some extra peptide-specific techniques, such as comparing distances between experimental peaks with the known exact masses of 20 proteinogenic amino acids (Kim and Pevzner, 2014). Since NPS is designed for much more chemically diverse PNP structures, it cannot rely on such assumptions and thus will normally lose to MS-GF+ and other leading proteomics tools on any regular peptide dataset.

PNP data (standard identification with Dereplicator)

We created the main natural product test dataset by combining 13 high-resolution GNPS spectral datasets (MSV000078568, 78604, 78606, 78635, 78787, 78803, 78817, 78839, 78936, 78937, 79098, 79450, 80102). The resulting dataset consists of ∼16 million high-resolution spectra. The target chemical database is PNPdatabase from Gurevich et al., 2018. The database consists of 5,021 compounds (1,582 PNP families) from AntiMarin (Blunt et al., 2007), DNP (Gozalbes and Pineda-Lucena, 2011), MIBiG (Medema et al., 2015), and StreptomeDB (Lucas et al., 2013) databases.

The figure demonstrates the number of identified PNP-Spectrum Matches (PSMs) at a given False Discovery Rate level (% FDR). FDR is estimated using the target-decoy approach (Elias and Gygi, 2007).

NPS shows a more than 45% increase in the number of PSMs compared to SPC at FDR 1% (10,287 vs 6,972).

PNP data (variable identification with VarQuest)

Since the VarQuest pipeline is considered less robust than Dereplicator, we tested it on three extensively studied GNPS datasets and rigorously validated the output. These three datasets are MSV000079450 (∼400,000 spectra from Pseudomonas isolates (Nguyen et al., 2016; Gurevich et al., 2018)), MSV000078604 (∼200,000 spectra from Streptomyces (Mohimani et al., 2014; Gurevich et al., 2018)), MSV000078839 (∼500,000 spectra from Streptomyces (Duncan et al., 2015; Mohimani et al., 2017; Gurevich et al., 2018)). The target chemical database is PNPdatabase as in the benchmarking above.

The figure demonstrates the number of identified PNP-Spectrum Matches (PSMs) at a given False Discovery Rate level (% FDR). FDR is estimated using the target-decoy approach (Elias and Gygi, 2007).

NPS-based VarQuest significantly increased the number of identified PSMs comparing to all other considered methods at all FDR levels. While SPC-based VarQuest showed the less accurate results than the Dereplicator methods, NPS-powered version of VarQuest outperformed all the competitors even at the strictest FDR 0% level.

Training PNP dataset

To create an appropriately sized training dataset, we processed ∼130 million GNPS mass spectra (list of 120 dataset accession numbers) against PNPdatabase with Dereplicator v.2.0 and curated the most reliable PNP annotations. Until recently, such high-quality training dataset was nearly impossible to obtain in the case of PNPs, so NPS to our knowledge is the first high-throughput PNP identification method that uses statistically learned scoring model. The created dataset is freely available here and can be used by other researchers in their future studies. The file includes paths to spectra files (starting from the GNPS dataset IDs), scan numbers of specific spectra inside the files and IDs of the corresponding PNPs from the database.

The initial set of Dereplicator annotations includes 14,757 PSMs corresponding to 420 unique PNPs (only hits with P-values below 1e-10 are present). To get the training set of a reasonable size and quality, we further considered all identifications of charge +1 and +2 at FDR level 5% and keep up to 5 best PSMs per compound. The resulting dataset contains 2,213 PSMs. The compositions of the dataset into various charges (+1, +2), structures (linear, cyclic, complex) and Dereplicator P -values are depicted below.

 

Examples of identified PSMs

Below are interactive visualizations of PSMs reported by NPS-engined Dereplicator/VarQuest and missed by SPC-based versions at appropriate FDR level. All hits were reported in MSV000079450 (Pseudomonas), MSV000078604 (Streptomyces №1) and MSV000078839 (Streptomyces №2) datasets described in details above. Some of these PSMs correspond to likely contaminants (e.g. Bacillus species in Streptomyces datasets).

Dereplicator hits:

VarQuest hits:

 

Feedback

In case of any questions/suggestions/comments regarding this page, please write to .