IgRepertoireConstructor
Content
Motivation
Construction of antibody repertoire is a preliminary step of any immunological analysis based on sequencing reads. Accurate construction of antibody repertoire avoids erroneous analysis of natural variations and antibody abundances.
Modern sequencing technologies (e.g., Illumina MiSeq) allow biologists to perform fulllength scanning of antibody repertoire. For example, overlapping pairedend Illumina MiSeq reads cover variable regions of antibody:
Problem of construction of antibody repertoire can be formulated as follows:
Antibody repertoire construction tool takes as an input sequencing reads (result of RNAsequencing of some Bcells) and constructs antibody repertoire as a set of antibody clusters. Each antibody cluster represent a group of identical antibodies and is characterized by its sequence and abundance.
Our approach
IgRepertoireConstructor constructs antibody repertoire from immunosequencing reads and uses mass spectra to validate the constructed antibody repertoires. The whole pipeline includes two parts:
 IgReC performs construction of antibody repertoire from immunosequencing reads: it corrects sequencing and amplification errors and joins together reads corresponding to identical antibodies. Constructed repertoire can be used as a highly accurate input for various immunological applications, e.g., construction of clonal trees, V(D)J labeling / CDR3 classification, mass spectra identification etc. Details of IgReC algorithm can be found below.
 Mass Spectra Analyzer is a tool for immunoproteogenomics analysis. Repertoire constructed by IgRepertoireConstructor can be converted into a database and used for mass spectra identifications (using some standard tool, e.g., MSGF+). Mass Spectra Analyzer performs analysis of the computed matches and computes various statistics showing similarity of the constructed repertoire and mass spectra. Mass Spectra Analyzer details are described below.
Pipeline of antibody repertoire analysis. 
IgReC
IgReC takes as an input reads (pairedend or single) covering variable region of antibodies and corrects sequencing and amplification errors.
Algorithm performs the following steps:
 VJ Finder: V, J labeling, cleaning and cropping input reads.
 Construction & clustering of Hamming graph: representation of cleaned reads as a Hamming graph and finding of groups of highly similar antibodies.
 Antibody construction: reconstruction of antibody sequences from the constructed groups, cleaning amplification errors.
VJ Finder
The main goals of VJ Finder tool are:
 Finding and discarding contaminated reads.
 Cropping remaining immunosequencing reads by the start position of the closest V gene segment and the end position of the closest J gene segment. VJ Finder also discards reads covering V gene segment only since they can not be unambiguously assigned to antibody cluster.
VJ Finder outputs information about the closest V and J gene segments in tabseparated view:
Read id  V start  V end  V score (% identity) 
V id  J start  J end  J score (% identity) 
J id 
read1  1  296  100.0  IGHV320*01  321  366  89.0  IGHJ5*02 
read2  1  294  98.64  IGHV39*01  309  354  100.0  IGHJ2*01 
…  …  …  …  …  …  …  …  … 
Construction and clustering of Hamming graph
We use Hamming graph to show similarity between reads. Vertices of Hamming graph represent reads remaining after VJ Finder, edge connects read1 and read2 if Hamming distance between these them is relatively small. After construction of graph, we use clustering techniques that allow us to find dense subgraphs (corrupted cliques) in the graph (see figure below). These dense subgraphs present groups of highly similar antibodies. Algorithm for search of corrupted cliques is also implemented as a separate tool and is described in Dense Subgraph Finder section.
Example of connected component of Hamming graph constructed from immunosequencing reads (heavy chain human repertoire).  Clustering of Hamming graph performed by IgReC. Each color corresponds to antibody cluster in repertoire. 
id=”mass_spectra_analysis”>Mass Spectra Analyzer
This step performs immunoproteogenomics analysis to validate constructed repertoire using mass spectra. It takes alignment of mass spectra against constructed repertoire in mzIdentML 1.1 format(e.g., generated by MSGF+) as an input and outputs a set of metrics and graphs that show how close are the constructed repertoire and provided mass spectra:
Histogram of peptide length distribution.  Histogram of variable region coverage by PSMs. Grey bars correspond to expected positions of CDRs. 
DSF (Dense Subgraph Finder)
Search for dense subgraphs (or corrupted cliques) is a very common problem arising in bioinformatics (e.g., coexpression of genes) and studies of social interactions (e.g., recommendation services).
We decided to make an algorithm for dense subgraphs search available as an independent tool. One can download IgRepertoireConstructor archive and run DSF for search of dense subgraphs. DSF takes as an input indirected graph in GRAPH format and outputs decomposition where the identical ids are assigned for vertices from the same dense subgraph. Density of constructed subgraph can be tuned via parameter f
(minimum edge fillin in dense subgraphs). Table below shows how dense subgraph decomposition varies depending of minimal edge fillin value.
Minimal edge fillin = 0.3  Minimal edge fillin = 0.4 
# dense subgraphs = 55 # trivial subgraphs = 40 size of maximal subgraph = 157 
# dense subgraphs = 61 # trivial subgraphs = 46 size of maximal subgraph = 104 
Minimal edge fillin = 0.5  Minimal edge fillin = 0.7 
# dense subgraphs = 60 # trivial subgraphs = 43 size of maximal subgraph = 87 
# dense subgraphs = 68 # trivial subgraphs = 42 size of maximal subgraph = 43 
Here is an example of DSF work for finding recommendations. Graph polbooks shows books about USA politics that were bought by the same customer on Amazon. Dense subgraphs in such graph present books that can be recommended together.
Minimal edge fillin = 0.2  Minimal edge fillin = 0.5  Minimal edge fillin = 0.7 
# dense subgraphs = 17 # trivial subgraphs = 8 size of maximal subgraph = 28 
# dense subgraphs = 30 # trivial subgraphs = 11 size of maximal subgraph = 16 
# dense subgraphs = 45 # trivial subgraphs = 16 size of maximal subgraph = 6 
Manual
Publications

IgRepertoireConstructor: a novel algorithm for antibody repertoire construction and immunoproteogenomics analysis. Bioinformatics, 2015 
IgRepertoireConstructor: a novel algorithm for antibody repertoire construction and immunoproteogenomics analysis. Talk on ISMB/ECCB 2015. Dublin, Ireland., 2015