PathRacer is a tool for alignment of assembly graph against pHMM. It provides the set of k most probable paths traversed by a HMM through the whole assembly graph. It supports both nucleotide and amino-acid pHMMs performing nt-to-aa translation on-fly walking through frameshifts.
PathRacer has two versions: main
pathracer is for aliening both nucleotide and amino-acid pHMMs against assembly graphs and
pathracer-seq-fs is for aligning amino-acid pHMMs against separate sequences allowing indels in the nucleotide space.
pathracer is supposed to be used on complex metagenome assembly graphs for fragmented gens assembly and annotation.
pathracer-fs-seq is supposed to be used as a replacement of original HMMer for sequences with high indel rate, e.g., PacBio/ONT contigs.
Both tool use extended pHMM model allowing frame shifts:
pathracer-seq-fs this extension is crucial: for aligning amino-acid pHMMs without allowing indels in the nucleotide space
six frame translation +
hmmsearch from HMMer package is more than enough.
Currently the tool supports only de Bruijn graphs in GFA format as produced by SPAdes or compatible assembler in this matter (e.g., MEGAHIT). Contact us if you need some other format support. Input sequences are supposed to be in FASTA/FASTQ format.
Profile HMM should be in HMMer3 format, but one can pass nucleotide or amino acid sequences as well. These sequences will be converted to proxy pHMMs. Aligning of these pHMMs would be equivalent to performing alignment using Levenshtein distance for each input sequence.
pathracercommand line options
Required positional arguments:
-oDIR: output directory
--aa: perform match against pHMM(s) [default] | nucleotide sequences | amino acid sequences
--queriesQ1 [Q2 [...]]: queries names to lookup [default: all queries from input query file]
--edgesE1 [E2 [...]]: match around particular edges [default: all graph edges]
--local: perform HMM-global, graph-local (aka glocal, default) or HMM-local, graph-local HMM matching
-lL: minimal length of resultant matched sequence; if ≤1 then to be multiplied on aligned HMM length [default: 0.9]
-rRATE: expected rate of nucleotides indels in graph edges [default: 0]. Used for AA pHMM aligning with frameshifts
--topN: extract up to N top scored paths [default: 10000]; only unique paths are reported and therefore
--rescore: rescore resulting paths by HMMer and produce output tables in HMMer standard formats
-tT: the total number of CPU threads to use [default: 16]
--parallel-components: process connected components of neighborhood subgraph in parallel
-mM: RAM limit in GB (PathRacer terminates if the limit is exceeded) [default: 100]
--annotate-graph: emit paths in GFA graph
--max-sizeMAX_SIZE: maximal component size to consider [default: INF]
--max-insertion-length: maximal allowed number of successive I-emissions [default: 30]
--no-top-score-filter: disable top score Event Graph vertices filter. Increases sensitivity of deep analysis (
Debug output control:
--debug: enable extensive debug console output
--draw: draw pictures around the interesting edges
--export-event-graph: export Event Graph in .cereal format
In addition: Some other developer options that are not supposed to be tuned by the end-user. Could be removed in further releases.
For each input pHMM (gene model)
pathracer-seq-fscommand line options
Required positional arguments:
--global | --local,
--memory: the same as in main
--sequencesS1 [S2 [...]]: sequence IDs to process [default: all input sequences]
-rRATE: expected rate of nucleotides indels in graph edges [default: 0.05]. Used for AA pHMM aligning with frameshifts
--max-fsN: maximal allowed number of frameshifts in a reported sequence [default: 10]
--cutoffCUTOFF: bitscore cutoff for reported match; if <= 1 then to be multiplied on GA HMM cutoff [default: 0.7]",
The same as in main
For each input pHMM (gene model): <gene_name>.seqs.fa and <gene_name>.nucls.fa, the same as in main
<gene_name>.seqs.fa and <gene_name>.nucls.fa files contain metainformation in FASTA headers.
pathracer the header format is:
>Score=PathRacer score|Edges=edges path|Position=starting position on the first edge|Alignment=CIGAR alignment
Prime (') after an edge ID means reverse complement
pathracer-seq-fs the header format is:
>Score=PathRacer score|Bitscore=HMMer bitscore for the whole sequence without incomplete codons|PartialBitscore=Maximal HMMer bitscore for fragment between frameshifts|Seq=Sequence ID|Position=Starting position in the sequence|Frameshifts=#Frameshifts|Alignment=CIGAR alignment
For alignment with frameshifts the extemded CIGAR/FASTA is used: P/"-" — one nucleotide insertion, G/"=" — two nucleotides insertion
One can download example datasets from here http://cab.spbu.ru/software/pathracer/
Lookup for beta-lactamase genes (amino acid pHMMs) in Singapore wastewater
pathracer bla_all.hmm urban_strain.gfa 55 --output pathracer_urban_strain_bla_all
Lookup for beta-lactamase genes (amino acid pHMMs) in AMR ONT plasmids (many indels!)
pathracer-fs-seq bla_all.hmm plasmids-ONT.fa --output pathracer_plasmids_ont_bla_all
Lookup for 16S/5S/23S (nucleotide HMMs) in E.coli multicell assembly
pathracer bac.hmm ecoli_mc.gfa 55 --output pathracer_ecoli_mc_bac
Lookup for known 16S sequences in E.coli multicell assembly
pathracer synth16S_new.fa ecoli_mc.gfa 55 --nt --output pathracer_ecoli_mc_16S_seqs
Lookup for known 16S sequences in SYNTH mock metagenome assembly
pathracer synth16S_new.fa synth_strain_gbuilder.gfa 55 --nt --output pathracer_synth_strain_gbuider_16S_seqs
Let us extract all 16S sequences from SYNTH mock metagenome assembly.
For this we increase
--top and disable Event Graph vertices filter (
Deep analysis of extremely complicated dataset also require stack and memory limits tuning
ulimit -s unlimited &&
pathracer bac.hmm synth_strain_gbuilder.gfa 55 --queries 16S_rRNA -m 250 --top 1000000 --output pathracer_synth_strain_gbuilder_16s --no-top-score-filter
If you are using PathRacer in your research, please cite:
A. Shlemov and A. Korobeynikov. PathRacer: racing profile HMM paths on assembly graph. In Proceedings of International Conference on Algorithms for Computational Biology, AlCoB 2019. Berkeley, California, USA, May 28–30, 2019, volume 11488 LNCS, pages 80–94, 2019.
In case of any problems running PathRacer please contact SPAdes support firstname.lastname@example.org attaching the log file. Your suggestions are also very welcome!