Minerva Barcoded Read Deconvolution

Emerging linked-read technologies (aka Read-Cloud or barcoded short-reads) have revived interest in short-read technology as a viable way to understand large-scale structure in genomes and metagenomes. Linked-read technologies, such as the 10x Chromium system, use a microfluidic system and a specialized set of barcodes to tag short DNA reads sourced from the same long fragment of DNA. Subsequently, the tagged reads are sequenced on standard short read platforms.

This approach results in interesting compromises. Each long fragment of DNA is only sparsely covered by reads, no information about the ordering of reads from the same fragment is preserved, and barcodes match reads from roughly 2-20 long fragments of DNA. However, compared to long read technologies the cost per base to sequence is far lower, far less input DNA is required, and the per base error rate is that of Illumina short-reads.

In the accompanying paper, we formally describe a particular algorithmic issue for linked-read technology: the deconvolution of reads with a single barcode into clusters that represent single long fragments of DNA. We also present Minerva, an algorithm which approximately solves the barcode deconvolution problem for metagenomic data. This codebase implements Minerva.

Minerva: An Alignment and Reference Free Approach to Deconvole Linked-Reads for Metagenomics

Installation

From PyPi

pip install minerva_deconvolve

From source

git clone <url>   
cd minerva_barcode_deconvolution
python setup.py install

Deconvolving Reads

Use the following command to run barcode deconvolution. <fastq> should be an interleaved fastq file where reads have a BX tag designating barcode (this is the default output of longranger basic)

cat <fastq> | minerva_deconvolve -k 20 -w 40 -d 8 -a 20 --remove-stopwords --eps 0.51 > ebc_assignments.tsv

For more options

minerva_deconvolve --help

Output

Minerva assigns barcoded reads to clusters within each barcode called deconvolved barcodes. The minerva_deconvolve command outputs a tsv file with three columns: read id, barcode, and cluster id. The deconvolved barcode for a read is a tuple of (barcode, cluster id).

$ head <minerva_output_file>
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1207:20627:25951 0
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1113:11082:83578 0
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:2206:4393:100450 0
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:2:1111:2014:28730  1
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:2216:17277:16384 2
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1201:19163:82220 2
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1202:16780:78102 0
BX:Z:GTGCCTTAGTCCGTAT-1 D00547:847:HYHNTBCXX:1:1210:7460:13722  2

To add enhanced barcodes to your fastq file run the following

minerva_deconvolve_fastq <bc_assignment_file> <fastq_file> - > output.fq

Performance

This is a demonstration program and is not intended to be performant. Runtimes over 10 hours are common even on small datasets. RAM usage is typically 50-100Gb.

Datasets

The datasets used in the paper may be downloaded from AWS.

Credits

This algorithm was devloped and tested with help from Dmitrii Meleshko, Daniela Bezdan, Chris Mason, and Iman Hajirasouliha.

This package is written and maintained by David C. Danko

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.circleci		.circleci
cext_minerva		cext_minerva
minerva		minerva
minerva_revisions		minerva_revisions
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Minerva Barcoded Read Deconvolution

Installation

Deconvolving Reads

Output

Performance

Datasets

Credits

About

Releases

Packages

Languages

License

lauren-mak/minerva_barcode_deconvolution

Folders and files

Latest commit

History

Repository files navigation

Minerva Barcoded Read Deconvolution

Installation

Deconvolving Reads

Output

Performance

Datasets

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages