Skip to content

Latest commit

 

History

History
183 lines (172 loc) · 6.36 KB

paper.md

File metadata and controls

183 lines (172 loc) · 6.36 KB
title tags authors affiliations date bibliography
sourmash v4: A multitool to quickly search, compare, and analyze genomic and metagenomic data sets
FracMinHash
MinHash
k-mers
Python
Rust
name orcid equal-contrib affiliation
Luiz Irber
0000-0003-4371-9659
true
1
name orcid equal-contrib affiliation
N. Tessa Pierce-Ward
0000-0002-2942-5331
true
1
name orcid affiliation
Mohamed Abuelanin
0000-0002-3419-4785
1
name orcid affiliation
Harriet Alexander
0000-0003-1308-8008
2
name orcid affiliation
Abhishek Anant
0000-0002-5751-2010
9
name orcid affiliation
Keya Barve
0000-0003-3241-2117
1
name orcid affiliation
Colton Baumler
0000-0002-5926-7792
1
name orcid affiliation
Olga Botvinnik
0000-0003-4412-7970
3
name orcid affiliation
Phillip Brooks
0000-0003-3987-244X
1
name orcid affiliation
Daniel Dsouza
0000-0001-7843-8596
9
name orcid affiliation
Laurent Gautier
0000-0003-0638-3391
9
name orcid affiliation
Mahmudur Rahman Hera
0000-0002-5992-9012
4
name orcid affiliation
Hannah Eve Houts
0000-0002-7954-4793
1
name orcid affiliation
Lisa K. Johnson
0000-0002-3600-7218
1
name orcid affiliation
Fabian Klötzl
0000-0002-6930-0592
5
name orcid affiliation
David Koslicki
0000-0002-0640-954X
4
name orcid affiliation
Marisa Lim
0000-0003-2097-8818
1
name orcid affiliation
Ricky Lim
0000-0003-1313-7076
9
name orcid affiliation
Bradley Nelson
0009-0001-1553-932X
9
name orcid affiliation
Ivan Ogasawara
0000-0001-5049-4289
9
name orcid affiliation
Taylor Reiter
0000-0002-7388-421X
1
name orcid affiliation
Camille Scott
0000-0001-8822-8779
1
name orcid affiliation
Andreas Sjödin
0000-0001-5350-4219
6
name orcid affiliation
Daniel Standage
0000-0003-0342-8531
7
name orcid affiliation
S. Joshua Swamidass
0000-0003-2191-0778
8
name orcid affiliation
Connor Tiffany
0000-0001-8188-7720
9
name orcid affiliation
Pranathi Vemuri
0000-0002-5748-9594
3
name orcid affiliation
Erik Young
0000-0002-9195-9801
1
name orcid corresponding affiliation
C. Titus Brown
0000-0001-6001-2677
true
1
name index
University of California Davis, Davis, CA, United States of America
1
name index
Woods Hole Oceanic Institution, Woods Hole, MA, Unites States of America
2
name index
Chan-Zuckerberg Biohub, San Francisco, CA, United States of America
3
name index
Pennsylvania State University, University Park, PA, United States of America
4
name index
Max Planck Institute for Evolutionary Biology, Plön, Germany
5
name index
Swedish Defence Research Agency (FOI), Stockholm, Sweden
6
name index
National Bioforensic Analysis Center, Fort Detrick, MD, United States of America
7
name index
Washington University in St Louis, St Louis, MO, United States of America
8
name index
No affiliation
9
31 Jan 2024
paper.bib

Summary

sourmash is a command line tool and Python library for sketching collections of DNA, RNA, and amino acid k-mers for biological sequence search, comparison, and analysis [@Pierce:2019]. sourmash's FracMinHash sketching supports fast and accurate sequence comparisons between datasets of different sizes [@gather], including taxonomic profiling [@portik2022evaluation], functional profiling [@hera2023fast], and petabase-scale sequence search [@branchwater]. From release 4.x, sourmash is built on top of Rust and provides an experimental Rust interface.

FracMinHash sketching is a lossy compression approach that represents data sets using a "fractional" sketch containing $1/S$ of the original k-mers. Like other sequence sketching techniques (e.g. MinHash, [@Ondov:2015]), FracMinHash provides a lightweight way to store representations of large DNA or RNA sequence collections for comparison and search. Sketches can be used to identify samples, find similar samples, identify data sets with shared sequences, and build phylogenetic trees. FracMinHash sketching supports estimation of overlap, bidirectional containment, and Jaccard similarity between data sets and is accurate even for data sets of very different sizes.

Since sourmash v1 was released in 2016 [@Brown:2016], sourmash has expanded to support new database types and many more command line functions. In particular, sourmash now has robust support for both Jaccard similarity and Containment calculations, which enables analysis and comparison of data sets of different sizes, including large metagenomic samples. As of v4.4, sourmash can convert these to estimated Average Nucleotide Identity (ANI) values, which can provide improved biological context to sketch comparisons [@hera2023deriving].

Statement of Need

Large collections of genomes, transcriptomes, and raw sequencing data sets are readily available in biology, and the field needs lightweight computational methods for searching and summarizing the content of both public and private collections. sourmash provides a flexible set of programmatic tools for this purpose, together with a robust and well-tested command-line interface. It has been used in over 350 publications (based on citations of @Brown:2016 and @Pierce:2019) and it continues to expand in functionality.

Acknowledgements

This work was funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative [GBMF4551 to CTB]. It is also funded in part by the National Science Foundation [#2018522 to CTB] and PIG-PARADIGM (Preventing Infection in the Gut of developing Piglets–and thus Antimicrobial Resistance – by disentAngling the interface of DIet, the host and the Gastrointestinal Microbiome) from the Novo Nordisk Foundation to CTB.

Notice: This manuscript has been authored by BNBI under Contract No. HSHQDC-15-C-00064 with the DHS. The US Government retains and the publisher, by accepting the article for publication, acknowledges that the USG retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for USG purposes. Views and conclusions contained herein are those of the authors and should not be interpreted to represent policies, expressed or implied, of the DHS.

References