Skip to content

lauren-mak/minerva_barcode_deconvolution

 
 

Repository files navigation

Minerva Barcoded Read Deconvolution

CircleCI

CodeFactor

Emerging linked-read technologies (aka Read-Cloud or barcoded short-reads) have revived interest in short-read technology as a viable way to understand large-scale structure in genomes and metagenomes. Linked-read technologies, such as the 10x Chromium system, use a microfluidic system and a specialized set of barcodes to tag short DNA reads sourced from the same long fragment of DNA. Subsequently, the tagged reads are sequenced on standard short read platforms.

This approach results in interesting compromises. Each long fragment of DNA is only sparsely covered by reads, no information about the ordering of reads from the same fragment is preserved, and barcodes match reads from roughly 2-20 long fragments of DNA. However, compared to long read technologies the cost per base to sequence is far lower, far less input DNA is required, and the per base error rate is that of Illumina short-reads.

In the accompanying paper, we formally describe a particular algorithmic issue for linked-read technology: the deconvolution of reads with a single barcode into clusters that represent single long fragments of DNA. We also present Minerva, an algorithm which approximately solves the barcode deconvolution problem for metagenomic data. This codebase implements Minerva.

Minerva: An Alignment and Reference Free Approach to Deconvole Linked-Reads for Metagenomics

Installation

From PyPi

pip install minerva_deconvolve

From source

git clone <url>   
cd minerva_barcode_deconvolution
python setup.py install

Deconvolving Reads

Use the following command to run barcode deconvolution. <fastq> should be an interleaved fastq file where reads have a BX tag designating barcode (this is the default output of longranger basic)

cat <fastq> | minerva_deconvolve -k 20 -w 40 -d 8 -a 20 --remove-stopwords --eps 0.51 > ebc_assignments.tsv

For more options

minerva_deconvolve --help

Output

Minerva assigns barcoded reads to clusters within each barcode called deconvolved barcodes. The minerva_deconvolve command outputs a tsv file with three columns: read id, barcode, and cluster id. The deconvolved barcode for a read is a tuple of (barcode, cluster id).

$ head <minerva_output_file>
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1207:20627:25951 0
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1113:11082:83578 0
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:2206:4393:100450 0
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:2:1111:2014:28730  1
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:2216:17277:16384 2
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1201:19163:82220 2
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1202:16780:78102 0
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1210:7460:13722  2

To add enhanced barcodes to your fastq file run the following

minerva_deconvolve_fastq <bc_assignment_file> <fastq_file> - > output.fq

Performance

This is a demonstration program and is not intended to be performant. Runtimes over 10 hours are common even on small datasets. RAM usage is typically 50-100Gb.

Datasets

The datasets used in the paper may be downloaded from AWS.

Credits

This algorithm was devloped and tested with help from Dmitrii Meleshko, Daniela Bezdan, Chris Mason, and Iman Hajirasouliha.

This package is written and maintained by David C. Danko

About

Sort Linked Read DNA Into Fragment Specific Clusters

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 82.9%
  • C 13.4%
  • R 3.7%