Given an increased interest in coronavirus research we developed a coronavirus assembly mode for SPAdes assembler (a.k.a. coronaSPAdes). It allows to assemble full-length coronaviridae genomes from the transcriptomic and metatranscriptomic data. Algorithmically, coronaSPAdes is a HMM-guided assembler and it is built upon biosyntheticSPAdes idea of using profile HMMs specific for gene/organism to enhance assembly.
Our preliminary results show that coronaSPAdes outperforms all other SPAdes modes and other popular assemblers in full-length coronavirus recovery.
Availability and data
You can download coronaSPAdes package as a part of 3.15.3 release (Linux binaries, MacOS binaries, or source code). To run coronaSPAdes either use the convenience wrapper “coronaspades.py” that use default set of coronaviral HMMs or pass
--custom-hmms folder_with_hmms parameters to spades.py to use your own set of HMMs. coronaSPAdes has Pfam_SARS-CoV-2_2.0 set of HMMs augmented with a subset of HMMs from (Phan et al, 2019) pre-packaged and uses this set of HMMs when run via coronaspades.py file. These HMMs allows to effectively assemble different coronavirus families, as well as bafiniviruses and toroviruses. coronaSPAdes is not limited to coronavirus assemblies. For HIV and influenza assemblies from the coronaSPAdes preprint we used the following HIV and influenza HMMs. If you want to create your own set of HMMs for HMM-guided assembly, note that for the custom set of HMMs the performance of the assembly would depend on the sensitivity of the chosen protein profile models. They should be universal enough to match the viral species of interest and cover the whole viral genome representing the majority of viral genes. At the same time they must be specific to a particular viral family to disallow spurious matches. Different databases such as Pfam, U-RVDB-prot and vFAM could be subset to create a suitable set of profile HMMs.
The latest preprint describing coronaSPAdes algorithm is available here.
coronaSPAdes has output similar to biosyntheticSPAdes biosyntheticSPAdes. In particular, the output directory will contain:
- – set of putative virus sequences derived from HMM matches.
You may want to check them first!
- – full set of scaffolds from input data (derived without HMMs)
- – various information about HMM alignments, including the coordinates of the matches, their order, etc.
Note that basic assembler results (contigs, scaffolds, assembly graph) are constructed from the input data without using any HMMs and therefore will be the same for the given set of reads. Therefore it is possible to use different HMM families and extract putative sequences belonging to different species without full reassembly. One just need to feed the assembly graph as input and use another set of HMMs. We would be grateful for feedback concerning using other profile HMMs for the task of (not only) virus assemblies.
SPAdes and coronaSPAdes is definitely capable of assembling the SARS-Cov2 genome. To support this claim, we note more than 300 SPAdes-produced assemblies in the GISAID SARS-CoV-2 database. However, we believe that in the majority cases the reference-based assembly approaches should be used rather than SPAdes (or any other de novo assembler). Several studies (i.e. https://www.biorxiv.org/content/10.1101/2020.04.26.062422v2.abstract) show that de novo assemblies in GISAID DB have more suspicious SNPs as compared to mapping based approaches.