A recent paper Souvorov et al. presented a new SKESA assembler, which is claimed to be highly efficient and accurate for assembling microbial isolates. Their results have motivated us to reconsider the default SPAdes behavior for performing the most “basic” of assemblies — assembling microbial isolates. Competitors often show us our weak points and countless improvements in our pipelines were inspired by the solutions pioneered by the authors of IDBA-UD, Platanus, Megahit, Trinity and many other excellent assembly tools.
The goal of this post is two-fold: 1) discussion of the benchmarking performed by Souvorov et al. from the point of view of modern best practices in comparing genome assemblies (along with the hard-to-avoid trade-offs between the accuracy and contiguity assembly metrics); 2) presenting the benchmarking results for the SPAdes configuration that is going to be default for assembling microbial isolates starting from the next release.
Benchmarks used in Souvorov et al., 2018
Overview
Evaluating de novo assemblies is a non-trivial problem with many nuances and pitfalls. It is especially true for some special cases, e.g. metagenomic assembly, where the evaluation methods are not very well developed and even the very notion of what constitutes a good assembly can be ambiguous and application-oriented. Fortunately, in the past years the scientific community was able to resolve many of the challenges associated with benchmarking tools for de novo assembly of microbial isolates.
The overall benchmarking design in Souvorov et al. is well-designed, considers a large set of diverse datasets and uses up to date versions of the software. These benchmarks result in the following key statements:
- SKESA generates assemblies with higher N50 than SPAdes on average (195 kbp vs 132 kbp);
- SKESA assemblies have significantly lower number of mismatches per 100 kbp than SPAdes on average (0.40 vs 3.21);
- SKESA is significantly faster on average than SPAdes.
Below we discuss these statements and provide some insights on SPAdes performance.
Controversial usage of N50 metrics
The benchmarks show that SPAdes assemblies are 20% larger on average than SKESA ones. The main reason for this is the low-coverage contamination that is removed by SKESA and assembled by SPAdes. There is no universal solution to such contamination — while in some projects such contamination should definitely be thrown out, in some cases one can be interested in everything that was sequenced. In fact, contamination does not decrease quality of SPAdes assemblies for the majority of species.
However, it is certainly unfair to compare assemblies of different lengths solely by N50. Adding contaminant contigs to the assembly may decrease N50 without deterioration of the its contiguity. Since in these benchmarks reference genomes are available, we suggest to use NG50 / NGA50 as a more just alternative.
Mismatch and indel rates vs contiguity trade-off
In case of microbial assembly from Illumina reads of reasonable coverage and quality most of mismatches/indels are caused by inexact repeats (see Figure 1 and Table 1) rather than sequencing errors. Therefore, since resolving repeats leads to an elevated number of mismatches and indels, but unresolved repeats increase the number of contigs, there exists a kind of trade-off between contiguity and correctness.
Figure 1. (a) Two inexact instances of the same repeat (R’ and R’’) are present in the genome. (b) During assembly graph simplification repeats are collapsed and form a consensus repeat sequence R. (c) After repeat is resolved, the resulting contigs contain the same sequence R, which contains mismatches and indels.
Metric/Assembly | Before repeat resolution | After repeat resolution |
NGA50, bp | 81632 | 185951 |
Assembly length, bp | 4826996 | 4859913 |
Mismatches per 100 kbp | 1.34 | 3.08 |
Indels per 100 kbps | 0.62 | 0.92 |
Table 1. Comparison of SPAdes contigs before and after repeat resolution generated from quality-trimmed reads. Increase in mismatch and indel rates indicates that most errors are caused by inexact repeats.
For example, for a 5 Mbp bacterial genome 1 mismatch per 100 kbp means 50 mismatches for the entire assembly. Thus, at most 50 genes out of ~5000 may contain an error. However, our observations reveal that mismatches are typically grouped inside different instances of inexact repeats, which may decrease the number of genes containing errors. On the other hand, deteriorated assembly contiguity may result in fragmented or entirely absent genes. Moreover, a couple mismatches usually do not prevent gene prediction tool from detecting a gene and correctly determining its function, while a contig break does. So, when speaking about low error rates (less than ~5 mismatches per 100 kbp), we prefer contiguity over correctness.
Launching assembly on trimmed vs untrimmed reads
From the manuscript one can see that SKESA pipeline includes both genome assembler and read/adapter trimming tool. Adapter trimming is a useful step for genome assembly if applied wisely. Once adapters are detected (e.g. using FastQC), one should trim their reads with their favorite adapter trimming tool (we prefer Trimmomatic or cutadapt, but there are other alternatives). Although SPAdes does include a read error correction tool called BayesHammer, it is not designed for adapter and read trimming, Thus, we believe that to perform a more fair comparison of two assemblers, one should run them using the same set of input reads.
SPAdes options selection
According to the text, the authors launched SPAdes with --only-assembler
option on the dataset where only single reads were available. However, this option just turns off internal sequencing error correction module BayesHammer and is not related to library type or scaffolding procedures. Thus, such parameter selection looks rather odd.
Running time
Souvorov et al. also shows a performance comparison on these datasets. While SKESA undoubtedly is a very fast assembler, we decided to reproduce the results provided in the paper. Our tests on the same datasets from the paper showed that SKESA is two times faster than SPAdes (with default settings) on average vs ~3,5 times stated in the paper. Such difference may be caused by the fact that the running time could easily be dominated by I/O time especially on the shared systems. We’re often seeing large difference in running times of similar assembly jobs simple because the I/O subsystem was overloaded at the same time by other tasks running on the same machine.
Updated SPAdes configuration for microbial isolates
Despite the critics above, we believe that SKESA is definitely a decent assembler for isolated bacterial genomes. Even more, it helped us to understand that current SPAdes default parameters for assembling microbial isolates were sub-optimal. Below we will provide several recommendations for isolate assembly using current SPAdes 3.13. These recommendations will be included as a settings preset in the next release.
We performed benchmarking of SKESA 2.3.0 and SPAdes 3.13 using datasets provided in SKESA study (“benchmark set”). The dataset contains 403 bacterial samples from SRA. To assess the assembly quality we used QUAST 5.0.0. We excluded 4 datasets where both tools reported > 10 mismatches per 100kb (SRR5413268, SRR2814770, SRR2820671, SRR5866647). For all the metrics reported by QUAST we computed the average value among all remaining datasets. Note, that average number of genes was computed only for 331 datasets, for which the genome annotation is available. Since SKESA does not provide scaffolds, we used SPAdes contigs in the comparison.
In our benchmark SPAdes was launched in three different ways:
- Default: default options, no read pre-processing;
- Trim: Trimmomatic was used for quality and adapter trimming, internal SPAdes read correction was switched off via
--only-assembler
option; - Trim-careful: same as trim, but with internal mismatch correction was turned on using
--careful
option.
We used Trimmomatic 0.38 with options ILLUMINACLIP:all_paired.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36 where all_paired.fa is a file with all adapters for paired reads provided with Trimmomatic concatenated.
Metric/Assembly | SKESA (trimmed reads) | SPAdes default | SPAdes trim | SPAdes
trim-careful |
N50, bp (“incorrect” metric because of different assembly lengths) | 197486 | 162496 | 170984 | 193024 |
NGA50, bp (“correct” contiguity metric) | 189968 | 178825 | 185951 | 211766 |
Assembly length, bp | 4178726 | 5019481 | 4859913 | 4858562 |
# misassemblies | 1.14 | 1.75 | 1.67 | 1.79 |
Mismatches per 100 kbp | 0.46 | 3.02 | 3.08 | 1.34 |
Indels per 100 kbps | 0.62 | 0.91 | 0.92 | 0.82 |
# complete genes | 4102 | 4108 | 4111 | 4118 |
Table 2. Comparison between SKESA and SPAdes with different options.
The table shows that SPAdes using default settings is slightly worse than SKESA in the assembly contiguity, but with new settings and adapter trimming SPAdes shows better results. We also confirm that SKESA is more accurate on average, but this advantage is compensated by the larger number of complete genes obtained in all SPAdes runs. In addition, it is worth noting that if SPAdes is launched in only-assembler mode using trimmed reads, running times of SPAdes and SKESA become rather similar.
So, here come the recommendations for SPAdes for assembly of isolated bacterial genomes sequenced with reasonable coverage(~100x and more):
- Use Trimmomatic (or your favorite read trimming tool) to remove adapters and low quality fragments prior to the assembly. We used the options mentioned above, but we believe that any quality & adapter trimmer with reasonable settings will work just fine.
- Do not use SPAdes internal read-correction procedure, i.e. run with
--only-assembler
option. - Run SPAdes in careful mode (use
--careful
option), which can improve the accuracy in cost of some running time.