Fork me on GitHub



Nerpa is a novel tool for discovering biosynthetic gene clusters (BGCs) of nonribosomal peptides (NRPs) by matching genome mining predictions with known chemical structures. Nerpa takes on input a set of genome sequences and a database of NRP molecules. The genomes are processed with antiSMASH v5 (Blin et al, Nucleic Acids Res. 2019) to locate putative BGCs and NRPSPredictor2 (Röttig et al, Nucleic Acids Res. 2011) to predict amino acid specificity for modules within the BGCs. Each NRP structure is converted into a monomer graph using rBAN (Ricart et al, J Cheminform. 2019) and further linearized in the case of a cyclic, branch-cyclic, or more complex graph topology. Nerpa uses a dynamic programming algorithm for aligning a sequence of predicted tentative amino acids against linearized monomer graphs of all NRPs from the database. The most likely alignment of an NRP and a BGC maximizes the Nerpa scoring function, a log-odds ratio of probabilities synthesizing the NRP using the given BGC or an undefined one (the Null hypothesis). The scoring function weights for matches, mismatches, and indels between various monomers and amino acid specificity predictions are statistically learned based on a curated set of NRP-BGC pairs from the MIBiG database (Medema et al, Nat. Chem. Biol. 2015).

Pipeline overview

Benchmarking on the MIBiG dataset

We compute False Discovery Rate (FDR) as a ratio of the number of wrongly identified BGC-NRP pairs to the total number of identifications. GARLIC is described in Dejong et al, Nat. Chem. Biol. 2016.

Trying Nerpa out

You may check out the first command-line release of Nerpa, as well as its source code, installation instructions, and test data from We are working on the Nerpa web interface that will be available from Stay tuned!


Kunyavskaya, O., Tagirdzhanov, A.M., et al. 2021. Nerpa: a tool for discovering biosynthetic gene clusters of bacterial nonribosomal peptides. Metabolites 202111, 693. (open access). Supplementary data include results of Nerpa screening of the RefSeq bacterial genomes (13,399 genomes) against the database of putative NRPs (8368 compounds) (Supplementary File S1) and the Nerpa training dataset: 64 known BGC-NRP alignments (Supplementary File S2).

Feedback, bug reports

If you have any questions or want to report a bug, please write to or post an issue on