Genome Assembly

Originally, we have developed a de novo genome assembler tool called SPAdes for the purpose of overcoming the complications associated with single-cell microbial data generated using MDA. Later, SPAdes was recognized by the scientific community as one of the best assemblers for bacterial data sets. This fact inspired us to extend the capabilities of SPAdes to include additional sequencing platforms besides Illumina (e.g. IonTorrent, PacBio and Oxford Nanopore) and to develop a set of novel software tools for various purposes: assembly of highly polymorphic genomes, plasmid assembly, metagenome assembly etc.


The analysis of concentrations of circulating antibodies in serum (antibody repertoire) is a fundamental problem in immunoinformatics. Antibodies are not directly encoded in germline but are extensively diversified by somatic gene recombination and mutations. These processes complicate analysis of the original repertoire from sequencing data. On the other hand, antibody repertoire is custom for each individual that results in a lack of gold standard test datasets. In order to address these challenges, we are developing a novel toolkit for various manipulations with sequencing data including algorithms for antibody repertoire construction, instruments for clonal and statistical analysis, quality assessment and simulation tools.

Antibiotics Discovery

Starting from penicillin, Natural Products have an exceptional track record in pharmacology: many antibiotics, antiviral and antitumor agents, immunosuppressors, and toxins are Natural Products. The recent discovery of teixobactin (the beginning of 2015) brought Natural Products back in the center of attention after a long period of a recession in antibiotic discovery efforts. Launch of the Global Natural Products Social (GNPS) Molecular Networking project also in 2015 combined together more than a billion mass spectra of natural products generated in over a hundred laboratories around the globe. While these spectra definitely contain new Natural Products including extremely useful from a medical point of view, revealing of them remains a challenging computational problem. Natural Products often contain non-standard amino acids and complex modifications greatly complicating their discovery.
Center for Algorithmic Biotechnology in collaboration with Center for Computational Mass Spectrometry at UCSD are working on software for cyclic and more complex peptide sequencing and dereplication that was successfully used in many collaborative projects.

Cancer Genomics

Advanced sequencing techniques changed our understanding of how tumours develop and differentiate. NGS technologies allowed to carry trials in which treatment options are guided by genomic characterization of their tumours. However, interpretation of the many events seen in tumour genomes presents a key challenge. Furthermore, cancer evolves under the selection pressure of drug treatment which leads to the presence of very rare or unique somatic events that need to be captured.
Two years of the Lab and AstraZeneca collaboration resulted in development of several pipelines for analyzing cancer genome data, including a tool for capture panel evaluation (TargQC), mutation prioritization and interpretation framework, sequencing data pre-analysis QC tool (PreQC), clinical reporting framework. Those tools were extensively used in production and scientific studies, and enabled discovery of TAGRISSO, a lung cancer drug, which was recently approved by FDA.


High-throughput metagenomics sequencing has become one of the most effective ways to study microbial communities sampled from the environment, as well as from living organism. Our group is developing

metaSPAdes software for de novo assembly of metagenomics samples as well as novel pipeline for analysis of series of metagenomics samples.

TSLR Analysis

Illumina has recently introduced TSLR technology that produces virtual long reads (up to 10 kb in length) derived from barcoded pools of short reads and promises to reduce the sequencing cost as compared to that of SMRT technology. The TSLR technology is based on fragmentation of genomic DNA into large segments (~10 kb long) and subsequent formation of random pools of the resulting segments (each pool contains ~300 segments). These fragments are clonally amplified, sheared, marked with a barcode that is unique to the pool and sequenced using the standard Illumina short reads. All short reads originating from the same barcoded pool are assembled together, resulting in a set of long contiguous sequences (contigs).
Unique sequencing pipeline of TSLR technology raises many computational challenges. This project is devoted to development of algorithms for efficient analysis of TSLR data including: 1) Barcode assembly, 2) Metagenome assembly from TSLRs, 3) Structural variation detection in human genome using TSLRs.


RNA-Seq is vastly used for well-studied organisms such as mouse and human, thus allowing to use reference-based methods for the analysis. However, multiple research projects study organisms with previously unsequenced genomes therefore creating a need for de novo transcriptome assembler. Due to varying expressions levels of different genes and isoforms, RNA-Seq data sets are characterized by highly-uneven coverage depth. Since SPAdes assembler is already capable of dealing with non-uniform coverage (typical for single-cell genomic data), we have decided to create rnaSPAdes — a SPAdes-based assembler for RNA-Seq data.
In addition, we complement it with rnaQUAST — a quality assessment tool for transcriptome assemblers, which works for both — model organisms with reference genome and gene database, and organisms whose genome is unknown.


Our research in computational proteomics mainly lies in the area of top-down mass spectrometry, which is a novel highly promising technology for acquiring mass spectra. In contrast to the traditional bottom-up approach, it does not require protein digestion prior to tandem mass spectrometry step. Analysis of intact proteins offers certain advantages, such as possibilities to detect post-translational modifications in a coordinated fashion and to identify multiple protein species.