bigbed-jaccard

Overview

bigbed-jaccard is a tool designed to quickly estimate pairwise Jaccard similarities between large numbers of bigBed files by using minhashing. If you only want to compare a couple of bed files, it will probably be faster to download them and compute the Jaccard distances directly using bedtools jaccard.

See Details for more information on the algorithm.

Installation

Clone this repo, cd into it, and run cargo build --release. The binary will then be available at target/release/bigbed-jaccard-similarity-matrix

This will require that you have cargo installed, I would recommend using rustup to install the Rust toolchain. Build has been tested on rustc version 1.42.0

Usage

Here, infile is simply a file containing a list of file URLs, one per line, and outfile is the name of the file to write the comparisons to.

$ target/release/bigbed-jaccard-similarity-matrix [infile] [outfile]

There is some utility code in this repository to facilitate data aquisition from the ENCODE portal. Namely, this script will allow you to flatten ENCODE file report TSVs containing JSON objects in the cloud_metadata field, requires pandas>=1.0. You can then convert the output of that script into a file list by extracting the url field and deleting the header.

Benchmarks

Running pairwise comparisons on the 2,677 bigBed files in this file took ~13 minutes on a Mac with 2.6 GHz 6-Core Intel i7 processor and an SSD, resulting in 3,581,827 output comparisons. Memory usage hovered around 1.5 GB.

Details

The minhash algorithm variant used here is bottom-k minhashing, using a single 32-bit hash function described in [1].

Future Work

I'd like to make the underlying code available as a Rust library and also generate Python (possibly NumPy) bindings for the Rust functions.

References

Mikkel Thorup. 2013. Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In Proceedings of the forty-fifth annual ACM symposium on Theory of Computing (STOC ’13). Association for Computing Machinery, New York, NY, USA, 371–380.
Anshumali Shrivastava and Ping Li. 2014. Improved densification of one permutation hashing. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI’14). AUAI Press, Arlington, Virginia, USA, 732–741.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bigbed-jaccard

Overview

Installation

Usage

Benchmarks

Details

Future Work

References

About

Releases

Packages

Languages

License

paul-sud/bigbed-jaccard

Folders and files

Latest commit

History

Repository files navigation

bigbed-jaccard

Overview

Installation

Usage

Benchmarks

Details

Future Work

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages