Emerging linked-read technologies (aka Read-Cloud or barcoded short-reads) have revived interest in short-read technology as a viable way to understand large-scale structure in genomes and metagenomes. Linked-read technologies, such as the 10x Chromium system, use a microfluidic system and a specialized set of barcodes to tag short DNA reads sourced from the same long fragment of DNA. Subsequently, the tagged reads are sequenced on standard short read platforms.
This approach results in interesting compromises. Each long fragment of DNA is only sparsely covered by reads, no information about the ordering of reads from the same fragment is preserved, and barcodes match reads from roughly 2-20 long fragments of DNA. However, compared to long read technologies the cost per base to sequence is far lower, far less input DNA is required, and the per base error rate is that of Illumina short-reads.
In the accompanying paper, we formally describe a particular algorithmic issue for linked-read technology: the deconvolution of reads with a single barcode into clusters that represent single long fragments of DNA. We also present Minerva, an algorithm which approximately solves the barcode deconvolution problem for metagenomic data. This codebase implements Minerva.
Minerva: An Alignment and Reference Free Approach to Deconvole Linked-Reads for Metagenomics
From PyPi
pip install minerva_deconvolve
From source
git clone <url>
cd minerva_barcode_deconvolution
python setup.py install
Use the following command to run barcode deconvolution. <fastq>
should be an interleaved fastq file where reads have a BX
tag designating barcode (this is the default output of longranger basic)
cat <fastq> | minerva_deconvolve -k 20 -w 40 -d 8 -a 20 --remove-stopwords --eps 0.51 > ebc_assignments.tsv
For more options
minerva_deconvolve --help
Minerva assigns barcoded reads to clusters within each barcode called deconvolved barcodes. The minerva_deconvolve
command outputs a tsv file with three columns: read id, barcode, and cluster id. The deconvolved barcode for a read is a tuple of (barcode, cluster id).
$ head <minerva_output_file>
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1207:20627:25951 0
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1113:11082:83578 0
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:2206:4393:100450 0
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:2:1111:2014:28730 1
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:2216:17277:16384 2
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1201:19163:82220 2
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1202:16780:78102 0
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1210:7460:13722 2
To add enhanced barcodes to your fastq file run the following
minerva_deconvolve_fastq <bc_assignment_file> <fastq_file> - > output.fq
This is a demonstration program and is not intended to be performant. Runtimes over 10 hours are common even on small datasets. RAM usage is typically 50-100Gb.
The datasets used in the paper may be downloaded from AWS.
This algorithm was devloped and tested with help from Dmitrii Meleshko, Daniela Bezdan, Chris Mason, and Iman Hajirasouliha.
This package is written and maintained by David C. Danko