title |
tags |
authors |
affiliations |
date |
bibliography |
sourmash v4: A multitool to quickly search, compare, and analyze genomic and metagenomic data sets |
FracMinHash |
MinHash |
k-mers |
Python |
Rust |
|
name |
orcid |
equal-contrib |
affiliation |
Luiz Irber |
0000-0003-4371-9659 |
true |
1 |
|
name |
orcid |
equal-contrib |
affiliation |
N. Tessa Pierce-Ward |
0000-0002-2942-5331 |
true |
1 |
|
name |
orcid |
affiliation |
Mohamed Abuelanin |
0000-0002-3419-4785 |
1 |
|
name |
orcid |
affiliation |
Harriet Alexander |
0000-0003-1308-8008 |
2 |
|
name |
orcid |
affiliation |
Abhishek Anant |
0000-0002-5751-2010 |
9 |
|
name |
orcid |
affiliation |
Keya Barve |
0000-0003-3241-2117 |
1 |
|
name |
orcid |
affiliation |
Colton Baumler |
0000-0002-5926-7792 |
1 |
|
name |
orcid |
affiliation |
Olga Botvinnik |
0000-0003-4412-7970 |
3 |
|
name |
orcid |
affiliation |
Phillip Brooks |
0000-0003-3987-244X |
1 |
|
name |
orcid |
affiliation |
Daniel Dsouza |
0000-0001-7843-8596 |
9 |
|
name |
orcid |
affiliation |
Laurent Gautier |
0000-0003-0638-3391 |
9 |
|
name |
orcid |
affiliation |
Mahmudur Rahman Hera |
0000-0002-5992-9012 |
4 |
|
name |
orcid |
affiliation |
Hannah Eve Houts |
0000-0002-7954-4793 |
1 |
|
name |
orcid |
affiliation |
Lisa K. Johnson |
0000-0002-3600-7218 |
1 |
|
name |
orcid |
affiliation |
Fabian Klötzl |
0000-0002-6930-0592 |
5 |
|
name |
orcid |
affiliation |
David Koslicki |
0000-0002-0640-954X |
4 |
|
name |
orcid |
affiliation |
Marisa Lim |
0000-0003-2097-8818 |
1 |
|
name |
orcid |
affiliation |
Ricky Lim |
0000-0003-1313-7076 |
9 |
|
name |
orcid |
affiliation |
Bradley Nelson |
0009-0001-1553-932X |
9 |
|
name |
orcid |
affiliation |
Ivan Ogasawara |
0000-0001-5049-4289 |
9 |
|
name |
orcid |
affiliation |
Taylor Reiter |
0000-0002-7388-421X |
1 |
|
name |
orcid |
affiliation |
Camille Scott |
0000-0001-8822-8779 |
1 |
|
name |
orcid |
affiliation |
Andreas Sjödin |
0000-0001-5350-4219 |
6 |
|
name |
orcid |
affiliation |
Daniel Standage |
0000-0003-0342-8531 |
7 |
|
name |
orcid |
affiliation |
S. Joshua Swamidass |
0000-0003-2191-0778 |
8 |
|
name |
orcid |
affiliation |
Connor Tiffany |
0000-0001-8188-7720 |
9 |
|
name |
orcid |
affiliation |
Pranathi Vemuri |
0000-0002-5748-9594 |
3 |
|
name |
orcid |
affiliation |
Erik Young |
0000-0002-9195-9801 |
1 |
|
name |
orcid |
corresponding |
affiliation |
C. Titus Brown |
0000-0001-6001-2677 |
true |
1 |
|
|
name |
index |
University of California Davis, Davis, CA, United States of America |
1 |
|
name |
index |
Woods Hole Oceanic Institution, Woods Hole, MA, Unites States of America |
2 |
|
name |
index |
Chan-Zuckerberg Biohub, San Francisco, CA, United States of America |
3 |
|
name |
index |
Pennsylvania State University, University Park, PA, United States of America |
4 |
|
name |
index |
Max Planck Institute for Evolutionary Biology, Plön, Germany |
5 |
|
name |
index |
Swedish Defence Research Agency (FOI), Stockholm, Sweden |
6 |
|
name |
index |
National Bioforensic Analysis Center, Fort Detrick, MD, United States of America |
7 |
|
name |
index |
Washington University in St Louis, St Louis, MO, United States of America |
8 |
|
name |
index |
No affiliation |
9 |
|
|
31 Jan 2024 |
paper.bib |
sourmash is a command line tool and Python library for sketching collections
of DNA, RNA, and amino acid k-mers for biological sequence search, comparison,
and analysis [@Pierce:2019]. sourmash's FracMinHash sketching supports fast and
accurate sequence comparisons between datasets of different sizes [@gather],
including taxonomic profiling [@portik2022evaluation], functional profiling
[@hera2023fast], and petabase-scale sequence search [@branchwater]. From
release 4.x, sourmash is built on top of Rust and provides an experimental
Rust interface.
FracMinHash sketching is a lossy compression approach that represents data
sets using a "fractional" sketch containing $1/S$ of the original k-mers. Like
other sequence sketching techniques (e.g. MinHash, [@Ondov:2015]), FracMinHash
provides a lightweight way to store representations of large DNA or RNA
sequence collections for comparison and search. Sketches can be used to
identify samples, find similar samples, identify data sets with shared
sequences, and build phylogenetic trees. FracMinHash sketching supports
estimation of overlap, bidirectional containment, and Jaccard similarity
between data sets and is accurate even for data sets of very different sizes.
Since sourmash v1 was released in 2016 [@Brown:2016], sourmash has expanded
to support new database types and many more command line functions.
In particular, sourmash now has robust support for both Jaccard similarity
and Containment calculations, which enables analysis and comparison of data
sets of different sizes, including large metagenomic samples. As of v4.4,
sourmash can convert these to estimated Average Nucleotide Identity (ANI)
values, which can provide improved biological context to sketch comparisons
[@hera2023deriving].
Large collections of genomes, transcriptomes, and raw sequencing data sets are
readily available in biology, and the field needs lightweight computational
methods for searching and summarizing the content of both public and private
collections. sourmash provides a flexible set of programmatic tools
for this purpose, together with a robust and well-tested command-line
interface. It has been used in over 350 publications (based on citations of
@Brown:2016 and @Pierce:2019) and it continues to expand in functionality.
This work was funded in part by the Gordon and Betty Moore Foundation’s
Data-Driven Discovery Initiative [GBMF4551 to CTB]. It is also funded in
part by the National Science Foundation [#2018522 to CTB] and PIG-PARADIGM
(Preventing Infection in the Gut of developing Piglets–and thus Antimicrobial
Resistance – by disentAngling the interface of DIet, the host and the
Gastrointestinal Microbiome) from the Novo Nordisk Foundation to CTB.
Notice: This manuscript has been authored by BNBI under Contract
No. HSHQDC-15-C-00064 with the DHS. The US Government retains and the
publisher, by accepting the article for publication, acknowledges that the USG
retains a non-exclusive, paid-up, irrevocable, world-wide license to publish
or reproduce the published form of this manuscript, or allow others to do
so, for USG purposes. Views and conclusions contained herein are those of
the authors and should not be interpreted to represent policies, expressed
or implied, of the DHS.