PathRacer: racing profile HMM paths on assembly graph

MANUAL

Overview

PathRacer is a tool for alignment of assembly graph against pHMM. It provides the set of k most probable paths traversed by a HMM through the whole assembly graph. It supports both nucleotide and amino-acid pHMMs performing nt-to-aa translation on-fly walking through frameshifts.

PathRacer has two versions: main pathracer is for aliening both nucleotide and amino-acid pHMMs against assembly graphs and pathracer-seq-fs is for aligning amino-acid pHMMs against separate sequences allowing indels in the nucleotide space. pathracer is supposed to be used on complex metagenome assembly graphs for fragmented gens assembly and annotation. pathracer-fs-seq is supposed to be used as a replacement of original HMMer for sequences with high indel rate, e.g., PacBio/ONT contigs.

Both tool use extended pHMM model allowing frame shifts:

Scheme of extended pHMM

but for pathracer-seq-fs this extension is crucial: for aligning amino-acid pHMMs without allowing indels in the nucleotide space six frame translation + hmmsearch from HMMer package is more than enough.

Input

Currently the tool supports only de Bruijn graphs in GFA format as produced by SPAdes or compatible assembler in this matter (e.g., MEGAHIT). Contact us if you need some other format support. Input sequences are supposed to be in FASTA/FASTQ format.

Profile HMM should be in HMMer3 format, but one can pass nucleotide or amino acid sequences as well. These sequences will be converted to proxy pHMMs. Aligning of these pHMMs would be equivalent to performing alignment using Levenshtein distance for each input sequence.

pathracer command line options

Required positional arguments:

  1. Query file (.hmm file or .fasta)
  2. Assembly graph in GFA format
  3. k (de Bruijn vertex overlap size) for the input graph

Main options:

Heuristics options:

Debug output control:

In addition: Some other developer options that are not supposed to be tuned by the end-user. Could be removed in further releases.

pathracer output

For each input pHMM (gene model) pathracer reports:

In addition:

pathracer-seq-fs command line options

Required positional arguments:

  1. Query .hmm file (.fasta is not supported yet)
  2. Sequence file (.fasta or .fastq)

Main options:

Heuristics options: The same as in main pathracer

pathracer-seq-fs output

For each input pHMM (gene model): <gene_name>.seqs.fa and <gene_name>.nucls.fa, the same as in main pathracer

Output files format

<gene_name>.seqs.fa and <gene_name>.nucls.fa files contain metainformation in FASTA headers. For main pathracer the header format is:

>Score=PathRacer score|Edges=edges path|Position=starting position on the first edge|Alignment=CIGAR alignment

E.g.:

>Score=366.239|Edges=255162_24353'|Position=9210|Alignment=186M2D186M

Prime (') after an edge ID means reverse complement

For pathracer-seq-fs the header format is:

>Score=PathRacer score|Bitscore=HMMer bitscore for the whole sequence without incomplete codons|PartialBitscore=Maximal HMMer bitscore for fragment between frameshifts|Seq=Sequence ID|Position=Starting position in the sequence|Frameshifts=#Frameshifts|Alignment=CIGAR alignment

E.g.

>Score=342.689|Bitscore=539.274|PartialBitscore=238.41|Seq=RB12-N|Position=2935|Alignment=55M1G1M1D20M1D14M1I3M2D11M1P1M1D64M1D62M1D1M1G23M1D30M
MSLYRRLVLLSCLSWPLAGFSATALTNLVAEPFAKLEQDFGGSIGVYAMDTGSGA=CSYR
AEERFPLCSSFKGFLAAVLARSQQGRLAGHTHPLRQNALVPWSPIS-KYLTTGMTVAELS
AAAVQYSDNAAANLLLKELGGPAGLTAFMRSIGDTTFRLDRWELELNSAIRAMRAIPHRR
ARDGKLTKLTLGSALAAPQRQQFVDWLKGNTTGNHRIRAAVPADWAVGDKTGTCG=YGTA
NDYAVVWPTGRAPIVLAVYRAPNKDDKHSEAVIAAAARLALEDWASTAV

For alignment with frameshifts the extemded CIGAR/FASTA is used: P/"-" — one nucleotide insertion, G/"=" — two nucleotides insertion

Examples

One can download example datasets from here http://cab.spbu.ru/software/pathracer/

Lookup for beta-lactamase genes (amino acid pHMMs) in Singapore wastewater
pathracer bla_all.hmm urban_strain.gfa 55 --output pathracer_urban_strain_bla_all

Lookup for beta-lactamase genes (amino acid pHMMs) in AMR ONT plasmids (many indels!)
pathracer-fs-seq bla_all.hmm plasmids-ONT.fa --output pathracer_plasmids_ont_bla_all

Lookup for 16S/5S/23S (nucleotide HMMs) in E.coli multicell assembly
pathracer bac.hmm ecoli_mc.gfa 55 --output pathracer_ecoli_mc_bac

Lookup for known 16S sequences in E.coli multicell assembly
pathracer synth16S_new.fa ecoli_mc.gfa 55 --nt --output pathracer_ecoli_mc_16S_seqs

Lookup for known 16S sequences in SYNTH mock metagenome assembly
pathracer synth16S_new.fa synth_strain_gbuilder.gfa 55 --nt --output pathracer_synth_strain_gbuider_16S_seqs

Let us extract all 16S sequences from SYNTH mock metagenome assembly. For this we increase --top and disable Event Graph vertices filter (--no-top-score-filter) Deep analysis of extremely complicated dataset also require stack and memory limits tuning
ulimit -s unlimited &&
export OMP_STACKSIZE=1G
pathracer bac.hmm synth_strain_gbuilder.gfa 55 --queries 16S_rRNA -m 250 --top 1000000 --output pathracer_synth_strain_gbuilder_16s --no-top-score-filter

References

If you are using PathRacer in your research, please cite:
A. Shlemov and A. Korobeynikov. PathRacer: racing profile HMM paths on assembly graph. In Proceedings of International Conference on Algorithms for Computational Biology, AlCoB 2019. Berkeley, California, USA, May 28–30, 2019, volume 11488 LNCS, pages 80–94, 2019.
https://link.springer.com/chapter/10.1007/978-3-030-18174-1_6

In case of any problems running PathRacer please contact SPAdes support spades.support@cab.spbu.ru attaching the log file. Your suggestions are also very welcome!