ProphAsm is a tool for computing simplitigs from k-mer sets. Simplitigs are strings obtained as disjoint paths in a bidirectional vertex-centric de Bruijn graph. Compared to unitigs, simplitigs provide an improvement in the number of sequences and their cumulative length, while both representations carry the same k-mers. For more details, see the paper.
Various types of sequencing datasets can be used as the input for ProphAsm, including genomes, pan-genomes, metagenomes or sequencing reads. Besides computing simplitigs, ProphAsm can also compute intersection and set differences of k-mer sets (while set unions are easy to compute simply by merging the source files).
Upon execution, ProphAsm first loads all specified datasets (see the -i
param) and the corresponding k-mer sets (see the -k
param). If the -x
param
is provided, ProphAsm then computes their intersection, subtracts the
intersection from the individual k-mer sets and computes simplitigs for the
intersection. If output files are specified (see the -o
param), it computes
also set differences.
To cite the concept of simplitigs and ProphAsm as a tool, please use the following reference:
Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biology 22(96), 2021; doi: https://doi.org/10.1186/s13059-021-02297-z
@article{brinda2021-simplitigs,
title = { Simplitigs as an efficient and scalable representation of de {Bruijn} graphs },
author = { Karel B{\v r}inda and Michael Baym and Gregory Kucherov },
journal = { Genome Biology },
volume = { 22 },
number = { 96 },
year = { 2021 },
doi = { 10.1186/s13059-021-02297-z }
}
For the concept of simplitigs, you might also consider citing the following paper from another group, introducing independently the same concept under the name spectrum-preserving string sets (SPSS):
Rahman A and Medvedev P. Representation of k-mer sets using spectrum-preserving string sets. Journal of Computational Biology 28(4), pp. 381-394, 2021. https://doi.org/10.1089/cmb.2020.0431
- GCC 4.8+ or equivalent
- ZLib
Download and compile ProphAsm:
git clone https://github.com/prophyle/prophasm
cd prophasm && make -j
Compute simplitigs:
./prophasm -k 15 -i tests/test1.fa -o simplitigs.fa
Set operations:
./prophasm -k 15 -i tests/test1.fa -i tests/test2.fa -o _out1.fa -o _out2.fa -x _intersect.fa -s _stats.tsv
Program: prophasm (a greedy assembler for k-mer set compression)
Version: 0.1.1
Contact: Karel Brinda <kbrinda@hsph.harvard.edu>
Usage: prophasm [options]
Examples: prophasm -k 15 -i f1.fa -i f2.fa -x fx.fa
- compute intersection of f1 and f2
prophasm -k 15 -i f1.fa -i f2.fa -x fx.fa -o g1.fa -o g2.fa
- compute intersection of f1 and f2, and subtract it from them
prophasm -k 15 -i f1.fa -o g1.fa
- re-assemble f1 to g1
Command-line parameters:
-k INT K-mer size.
-i FILE Input FASTA file (can be used multiple times).
-o FILE Output FASTA file (if used, must be used as many times as -i).
-x FILE Compute intersection, subtract it, save it.
-s FILE Output file with k-mer statistics.
-S Silent mode.
Note that '-' can be used for standard input/output.
def extend_simplitig_forward (K, simplitig):
extending = True
while extending:
extending = False
q = simplitig[-k+1:]
for x in [‘A’, ‘C’, ‘G’, ‘T’]:
kmer = q + x
if kmer in K:
extending = True
simplitig = simplitig + x
K.remove (kmer)
K.remove (reverse_complement (kmer))
break
return K, simplitig
def get_maximal_simplitig (K, initial_kmer):
simplitig = initial_kmer
K.remove (initial_kmer)
K.remove (reverse_complement (initial_kmer))
K, simplitig = extend_simplitig_forward (K, simplitig)
simplitig = reverse_complement (simplitig)
K, simplitig = extend_simplitig_forward (K, simplitig)
return K, simplitig
def compute_simplitigs (kmers):
K = set()
for kmer in kmers:
K.add (kmer)
K.add (reverse_complement(kmer))
simplitigs = set()
while |K|>0:
initial_kmer = K.random()
K, simplitig = get_maximal_simplitig (K, initial_kmer)
simplitigs.add (simplitig)
return simplitigs
- Sneak peek at the -tigs! - An overview of different *tigs in computational biology.
- UST - Another tool for computing simplitigs. Unlike ProphAsm, UST requires pre-computed unitigs as the input, therefore the method is overall more resource-demanding.
- BCalm 2 - The best available tool for computing unitigs.
- Unikmer - Another tool for k-mer set operations.
Please use Github issues.
See Releases.
Karel Brinda <karel.brinda@hms.harvard.edu>