Nerpa is a novel tool for discovering biosynthetic gene clusters (BGCs) of nonribosomal peptides (NRPs) by matching genome mining predictions with known chemical structures. Nerpa takes on input a set of genome sequences and a database of NRP molecules. The genomes are processed with antiSMASH v5 (Blin et al, Nucleic Acids Res. 2019) to locate putative BGCs and with NRPSPredictor2 (Röttig et al, Nucleic Acids Res. 2011) to predict amino acid specificity for modules within the BGCs. Each NRP structure is converted into a monomer graph using rBAN (Ricart et al, J Cheminform. 2019) and further linearized in the case of a cyclic, branch-cyclic, or more complex graph topology. Nerpa uses a dynamic programming algorithm for aligning a sequence of predicted tentative amino acids against linearized monomer graphs of all NRPs from the database. The most likely alignment of an NRP and a BGC maximizes the Nerpa scoring function which is a log-odds ratio of probabilities synthesizing the NRP using the given BGC or using an undefined one (the Null hypothesis). The scoring function weights for matches, mismatches, and indels between various monomers and amino acid specificity predictions are statistically-learned based on a curated set of NRP-BGC pairs from the MIBiG database (Medema et al, Nat. Chem. Biol. 2015).
Simplified Nerpa pipeline
Nerpa benchmarking on MIBiG dataset
GARLIC is described in Dejong et al, Nat. Chem. Biol. 2016.
Note: this is an ongoing project, so stay tuned! If you want to try Nerpa pre-release version or you wish to get notification about the first public release, please write to . Nerpa web service will be available from http://cab.cc.spbu.ru/
This work is funded by RFBR, project number 19-34-51017.