SPAdes for novel technologies

Adaptation of SPAdes software to usage of novel technologies applied in analysis and assembly of genomic, metagenomic and transcriptomic data
RSF #19-14-00172

Anton Korobeynikov
Dmitry Antipov
Alla Lapidus
Andrey Prjibelski
Alexander Shlemov
Elena Bushmanova
Tatiana Dvorkina
Olga Kunyavskaya
Ivan Tolstoganov

In this project we plan to significantly enhance functionality of SPAdes genome assembler as well as its derivative algorithms — hybridSPAdes, metaSPAdes and rnaSPAdes. The key idea is to integrate the support of novel sequencing technologies into existing pipelines for short-read genome, metagenome and transcriptome assembly.

Modern transcriptome assembly methods allow to reconstruct only partial sequences complex alternatively spliced isoforms. At the same time, estimation of expression level in most cases is possible only for entire genes, but not individual isoforms. The problem of de novo assembly becomes even more challenging, when the sample contains molecules form multiple organisms or even large bacterial communities. Even though prokaryotic genes have a simple structure comparing to eukaryotes, the presence of related strains in the communities, as well as homologous genes and common repetitive sequences in the genomes of different species, complicates the problem of reconstructing full-length transcripts and genomic sequences based on the short-read data.

In this project we propose to improve existing and implement novel computational methods for metatranscriptome assembly, as well as hybrid metagenome and RNA-Seq assembly, which involves processing of data obtained with various sequencing technologies. Modern assembly algorithms use genome sequence graphs (assembly graphs) as one of representations for genome assembly. Effective and accurate solutions for aligning nucleotide and amino acid sequences onto such graphs will be useful in various applications, such as hybrid assembly, long reads error-correction and haplotype separation. For example, accurate haplotype separation may be useful for investigation of allele-specific gene expression levels and studying compound heterozygosity. An important generalization of the sequence-to-graph alignment problem is the alignment of probability models of various gene families, represented by hidden Markov models (HMM). In other words, the result of such alignment is a set of paths that correspond to the given model with high probability. The solution of this problem will open a conceptually new way of gene finding (including novel genes or variations of existing ones) in the assembly, and improve and verify the assembly itself using discovered gene paths. This approach is particularly important for metagenome analysis, since de novo methods often fail to assemble genes of interest.

One of the key applications of HMM-to-graph alignment problem is detection of antibiotic resistance genes. In this project we aim to design and implement algorithms for alignment of long error-prone DNA reads, as well as nucleic and amino acid sequences HMMs containing information for entire gene families. Long-read sequencing technologies have resulted in significantly improved genome assemblies as compared to short-read sequencing. However, their applications remain either expensive in terms of per-base cost, or complex regarding sample preparation process. In contrast, recently developed synthetic long read (SLR) technologies (developed by Illumina, 10X Genomics, and other companies) combine the accuracy and low cost of short reads with the long range information, making them an attractive alternative to error-prone long reads. In this project we plan to develop a novel algorithm for genome assembly and metagenomic binning using Chromium SLRs. Genomic repeats were one of the main stumbling blocks in the problem of de novo genome assembly problem through its entire existence. Multiple biotechnological and algorithmic advances, such as generation of long-range mate-pairs, long reads and development of new assembly tools have significantly improved the quality of assembled sequences. However, finishing complex genomes containing long repeats still require additional information and manual analysis. To address this common problem we propose to exploit recently emerged high-resolution optical and electronic genomic maps. Existing methods allows to produce accurate maps at high-throughput and relatively small cost, thus making it widely applicable. In this project we plan to develop algorithms for hybrid assembly of short reads and genomic maps, which allows to resolve complex repeats and obtain accurate and continuous genomic sequences.
In addition, we prove the advantages of using genomic maps by comparing the quality of the generated assemblies for various species with the assemblies obtained from the state-of-the-art assemblers, which utilize alternative long-range sequencing technologies, such as mate-pairs, PacBio and Oxford Nanopores. Chromosome conformation capture (3C) includes a series of technologies that allow to detect genomic loci that are close in 3-D space but far in genome. Recent updates allows to apply 3C-based technologies for the genome scaffolding and metagenomic binning. Since 3C-based technologies are usually used with short-read sequencing, it is natural to integrate 3C-data within SPAdes short-read assembler. New algorithms developed for 3C-data assembly will significantly improve the assembly quality in terms of completeness and contiguity of reconstructed individual genomes, which will open new possibilities for further metagenomic analysis.