Given an increased interest in coronavirus research we developed a coronavirus assembly mode for SPAdes assembler (a.k.a. coronaSPAdes). It allows to assemble full-length coronaviridae genomes from the transcriptomic and metatranscriptomic data. Algorithmically, coronaSPAdes is a HMM-guided assembler and it is built upon biosyntheticSPAdes idea of using profile HMMs specific for gene/organism to enhance assembly.
Our preliminary results show that coronaSPAdes outperforms all other SPAdes modes and other popular assemblers in full-length coronavirus recovery.
Availability and data
You can download coronaSPAdes pre-release package (current version from 2020-10-04) from here. To run coronaSPAdes either use the convenience wrapper “coronaspades.py” or pass
--custom-hmms set_of_hmms parameters to spades.py. coronaSPAdes has Pfam_SARS-CoV-2_2.0 set of HMMs pre-packaged and uses this set of HMMs when run via coronaspades.py file. Note that these HMMs might be somehow biased towards betacoronaviruses. For a broader virus families including bafiniviruses and toroviruses, one could use hmms from (Phan et al, 2019) available here. Note that coronaSPAdes is not limited to coronavirus assemblies, see below for options concerning other species.
Please note that there is no pre-built version of coronaSPAdes. One needs to compile it from the package above. The compilation instructions could be found in SPAdes manual
coronaSPAdes is based on biosyntheticSPAdes and has similar output (note that this prerelease may use somehow misleading terminology centered around biosynthetic genes clusters). In particular, the output directory will contain:
- – set of putative virus sequences derived from HMM matches. You may want to check them first!
- – full set of scaffolds from input data (derived without HMMs)
- – various information about HMM alignments, including the coordinates of the matches, their order, etc.
Note that basic assembler results (contigs, scaffolds, assembly graph) are constructed from the input data without using any HMMS and therefore will be the same for the given set of reads. Therefore it is possible to use different HMM families and extract putative sequences belonging to different species without full reassembly. One just need to feed the assembly graph as input and use another set of HMMs. We would be grateful for feedback concerning using other profile HMMs for the task of (not only) virus assemblies.
SPAdes and coronaSPAdes is definitely capable of assembling the SARS-Cov2 genome. To support this claim, we note more than 300 SPAdes-produced assemblies in the GISAID SARS-CoV-2 database. However, we believe that in the majority cases the reference-based assembly approaches should be used rather than SPAdes (or any other de novo assembler). Several studies (i.e. https://www.biorxiv.org/content/10.1101/2020.04.26.062422v2.abstract) show that de novo assemblies in GISAID DB have more suspicious SNPs as compared to mapping based approaches.