Given an increased interest in coronavirus research we developed a coronavirus assembly mode for SPAdes assembler (a.k.a. coronaSPAdes). It allows to assemble full-length coronaviridae genomes from the transcriptomic and metatranscriptomic data. Algorithmically, coronaSPAdes is a HMM-guided assembler and it is built upon biosyntheticSPAdes idea of using profile HMMs specific for gene/organism to enhance assembly.
Our preliminary results show that coronaSPAdes outperforms all other SPAdes modes and other popular assemblers in full-length coronavirus recovery.
Availability and data
You can download coronaSPAdes package as a part of 3.15.0 release (Linux binaries, MacOS binaries, or source code). To run coronaSPAdes either use the convenience wrapper “coronaspades.py” or pass
--custom-hmms set_of_hmms parameters to spades.py. coronaSPAdes has Pfam_SARS-CoV-2_2.0 set of HMMs augmented with a subset of HMMs from (Phan et al, 2019) pre-packaged and uses this set of HMMs when run via coronaspades.py file. These HMMs allows to effectively assemble different coronavirus families, as well as bafiniviruses and toroviruses. Note that coronaSPAdes is not limited to coronavirus assemblies. For HIV and influenza assemblies from the coronaSPAdes preprint we used the following HIV and influenza HMMs.
The latest preprint describing coronaSPAdes algorithm is available here.
coronaSPAdes is based on biosyntheticSPAdes and has similar output (note that this prerelease may use somehow misleading terminology centered around biosynthetic genes clusters). In particular, the output directory will contain:
- – set of putative virus sequences derived from HMM matches.
You may want to check them first!
- – full set of scaffolds from input data (derived without HMMs)
- – various information about HMM alignments, including the coordinates of the matches, their order, etc.
Note that basic assembler results (contigs, scaffolds, assembly graph) are constructed from the input data without using any HMMS and therefore will be the same for the given set of reads. Therefore it is possible to use different HMM families and extract putative sequences belonging to different species without full reassembly. One just need to feed the assembly graph as input and use another set of HMMs. We would be grateful for feedback concerning using other profile HMMs for the task of (not only) virus assemblies.
SPAdes and coronaSPAdes is definitely capable of assembling the SARS-Cov2 genome. To support this claim, we note more than 300 SPAdes-produced assemblies in the GISAID SARS-CoV-2 database. However, we believe that in the majority cases the reference-based assembly approaches should be used rather than SPAdes (or any other de novo assembler). Several studies (i.e. https://www.biorxiv.org/content/10.1101/2020.04.26.062422v2.abstract) show that de novo assemblies in GISAID DB have more suspicious SNPs as compared to mapping based approaches.