Fork me on GitHub

IgRepertoireConstructor

IgRepertoireConstructor 3.0.0 is released!

Includes new tools BarcodedIgReC and IgDiversityAnalyzer, major update of IgQUAST (former aimQUAST), and various speedups and improvements.

Download IgRepertoireConstructor source code (zip)

Download IgRepertoireConstructor source code (tar.gz)

Motivation

Construction of antibody repertoire is a preliminary step of any immunological analysis based on sequencing reads. Accurate construction of antibody repertoire avoids erroneous analysis of natural variations and antibody abundances.

Modern sequencing technologies (e.g., Illumina MiSeq) allow biologists to perform full-length scanning of antibody repertoire. For example, overlapping paired-end Illumina MiSeq reads cover variable regions of antibody:

Antibody sequencing

Problem of construction of antibody repertoire can be formulated as follows:

pipeline

Antibody repertoire construction tool takes as an input sequencing reads (result of RNA-sequencing of some B-cells) and constructs antibody repertoire as a set of antibody clusters. Each antibody cluster represent a group of identical antibodies and is characterized by its sequence and abundance.

Our approach

IgRepertoireConstructor constructs antibody repertoire from immunosequencing reads and uses mass spectra to validate the constructed antibody repertoires. The whole pipeline includes two parts:

  • IgReC performs construction of antibody repertoire from immunosequencing reads: it corrects sequencing and amplification errors and joins together reads corresponding to identical antibodies. Constructed repertoire can be used as a highly accurate input for various immunological applications, e.g., construction of clonal trees, V(D)J labeling / CDR3 classification, mass spectra identification etc. Details of IgReC algorithm can be found below.
  • Mass Spectra Analyzer is a tool for immunoproteogenomics analysis. Repertoire constructed by IgRepertoireConstructor can be converted into a database and used for mass spectra identifications (using some standard tool, e.g., MS-GF+). Mass Spectra Analyzer performs analysis of the computed matches and computes various statistics showing similarity of the constructed repertoire and mass spectra. Mass Spectra Analyzer details are described below.
pipeline
Pipeline of antibody repertoire analysis.

IgReC

IgReC takes as an input reads (paired-end or single) covering variable region of antibodies and corrects sequencing and amplification errors.

Algorithm performs the following steps:

  • VJ Finder: V, J labeling, cleaning and cropping input reads.
  • Construction & clustering of Hamming graph: representation of cleaned reads as a Hamming graph and finding of groups of highly similar antibodies.
  • Antibody construction: reconstruction of antibody sequences from the constructed groups, cleaning amplification errors.

VJ Finder

The main goals of VJ Finder tool are:

  1. Finding and discarding contaminated reads.
  2. Cropping remaining immunosequencing reads by the start position of the closest V gene segment and the end position of the closest J gene segment. VJ Finder also discards reads covering V gene segment only since they can not be unambiguously assigned to antibody cluster.

VJ Finder outputs information about the closest V and J gene segments in tab-separated view:

Read id V start V end V score
(% identity)
V id J start J end J score
(% identity)
J id
read1 1 296 100.0 IGHV3-20*01 321 366 89.0 IGHJ5*02
read2 1 294 98.64 IGHV3-9*01 309 354 100.0 IGHJ2*01

Construction and clustering of Hamming graph

We use Hamming graph to show similarity between reads. Vertices of Hamming graph represent reads remaining after VJ Finder, edge connects read1 and read2 if Hamming distance between these them is relatively small. After construction of graph, we use clustering techniques that allow us to find dense subgraphs (corrupted cliques) in the graph (see figure below). These dense subgraphs present groups of highly similar antibodies. Algorithm for search of corrupted cliques is also implemented as a separate tool and is described in Dense Subgraph Finder section.

hamming_graph_before_clusterization hamming_graph_after_clusterization
Example of connected component of Hamming graph constructed from immunosequencing reads (heavy chain human repertoire). Clustering of Hamming graph performed by IgReC. Each color corresponds to antibody cluster in repertoire.

id=”mass_spectra_analysis”>Mass Spectra Analyzer

This step performs immunoproteogenomics analysis to validate constructed repertoire using mass spectra. It takes alignment of mass spectra against constructed repertoire in mzIdentML 1.1 format(e.g., generated by MS-GF+) as an input and outputs a set of metrics and graphs that show how close are the constructed repertoire and provided mass spectra:

peptide_length_Total PSM_cov_Total
Histogram of peptide length distribution. Histogram of variable region coverage by PSMs. Grey bars correspond to expected positions of CDRs.

 

DSF (Dense Subgraph Finder)

Search for dense subgraphs (or corrupted cliques) is a very common problem arising in bioinformatics (e.g., co-expression of genes) and studies of social interactions (e.g., recommendation services).

We decided to make an algorithm for dense subgraphs search available as an independent tool. One can download IgRepertoireConstructor archive and run DSF for search of dense subgraphs. DSF takes as an input indirected graph in GRAPH format and outputs decomposition where the identical ids are assigned for vertices from the same dense subgraph. Density of constructed subgraph can be tuned via parameter -f (minimum edge fill-in in dense subgraphs). Table below shows how dense subgraph decomposition varies depending of minimal edge fill-in value.

dsf_example_edge_fill-in_0.3
dsf_example_edge_fill-in_0.4
Minimal edge fill-in = 0.3 Minimal edge fill-in = 0.4
# dense subgraphs = 55
# trivial subgraphs = 40
size of maximal subgraph = 157
# dense subgraphs = 61
# trivial subgraphs = 46
size of maximal subgraph = 104
dsf_example_edge_fill-in_0.5 dsf_example_edge_fill-in_0.7
Minimal edge fill-in = 0.5 Minimal edge fill-in = 0.7
# dense subgraphs = 60
# trivial subgraphs = 43
size of maximal subgraph = 87
# dense subgraphs = 68
# trivial subgraphs = 42
size of maximal subgraph = 43

Here is an example of DSF work for finding recommendations. Graph polbooks shows books about USA politics that were bought by the same customer on Amazon. Dense subgraphs in such graph present books that can be recommended together.

polbooks_fill-in_0.2
polbooks_fill-in_0.2
polbooks_fill-in_0.2
Minimal edge fill-in = 0.2 Minimal edge fill-in = 0.5 Minimal edge fill-in = 0.7
# dense subgraphs = 17
# trivial subgraphs = 8
size of maximal subgraph = 28
# dense subgraphs = 30
# trivial subgraphs = 11
size of maximal subgraph = 16
# dense subgraphs = 45
# trivial subgraphs = 16
size of maximal subgraph = 6

Manual

  • IgRepertoireConstructor manual can be found here.
  • DSF manual can be found here.

Publications

  1. Safonova Y. IgRepertoireConstructor: a novel algorithm for antibody repertoire construction and immunoproteogenomics analysis. Talk on ISMB/ECCB 2015. Dublin, Ireland., 2015