Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add urlsketch command #34

Merged
merged 15 commits into from
May 21, 2024
76 changes: 67 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,25 @@
[![DOI](https://zenodo.org/badge/792101561.svg)](https://zenodo.org/doi/10.5281/zenodo.11165725)


tl;dr - download and sketch NCBI Assembly Datasets by accession
tl;dr - download and sketch data directly

## About

This plugin is an attempt to improve database generation by downloading assemblies, checking md5sum, and sketching to a sourmash zipfile. FASTA files can also be saved if desired. It's quite fast, but still very much at alpha level. Here be dragons.
Commands:

- `gbsketch` - download and sketch NCBI Assembly Datasets by accession
- `urlsketch` - download and sketch directly from a url

This plugin is an attempt to improve sourmash database generation by downloading files, checking md5sum if provided or accessible, and sketching to a sourmash zipfile. FASTA files can also be saved if desired. It's quite fast, but still very much at alpha level. Here be dragons.

## Installation

```
pip install sourmash_plugin_directsketch
```

## Usage
## `gbsketch`
download and sketch NCBI Assembly Datasets by accession

### Create an input file

Expand All @@ -43,15 +49,13 @@ For reference:

To run the test accession file at `tests/test-data/acc.csv`, run:
```
sourmash scripts gbsketch tests/test-data/acc.csv -o test.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1
sourmash scripts gbsketch tests/test-data/acc.csv -o test-gbsketch.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1
```

Full Usage:

```
usage: gbsketch [-h] [-q] [-d] -o OUTPUT [-f FASTAS] [-k] [--download-only] [--failed FAILED] [-p PARAM_STRING] [-c CORES]
[-r RETRY_TIMES] [-g | -m]
input_csv
usage: gbsketch [-h] [-q] [-d] [-o OUTPUT] [-f FASTAS] [-k] [--download-only] [--failed FAILED] [-p PARAM_STRING] [-c CORES] [-r RETRY_TIMES] [-g | -m] input_csv

download and sketch GenBank assembly datasets

Expand All @@ -66,7 +70,7 @@ options:
output zip file for the signatures
-f FASTAS, --fastas FASTAS
Write fastas here
-k, --keep-fastas write FASTA files in addition to sketching. Default: do not write FASTA files
-k, --keep-fasta write FASTA files in addition to sketching. Default: do not write FASTA files
--download-only just download genomes; do not sketch
--failed FAILED csv of failed accessions and download links (should be mostly protein).
-p PARAM_STRING, --param-string PARAM_STRING
Expand All @@ -76,7 +80,61 @@ options:
-r RETRY_TIMES, --retry-times RETRY_TIMES
number of times to retry failed downloads
-g, --genomes-only just download and sketch genome (DNA) files
-m, --proteomes-only just download and sketch proteome (protein) files
```

## `urlsketch`
download and sketch directly from a url
### Create an input file

First, create a file, e.g. `acc-url.csv` with identifiers, sketch names, and other required info.
```
accession,name,moltype,md5sum,download_filename,url
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-454,dna,47b9fb20c51f0552b87db5d44d5d4566,GCA_000961135.2_genomic.urlsketch.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/961/135/GCA_000961135.2_ASM96113v2/GCA_000961135.2_ASM96113v2_genomic.fna.gz
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-454,protein,fb7920fb8f3cf5d6ab9b6b754a5976a4,GCA_000961135.2_protein.urlsketch.faa.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/961/135/GCA_000961135.2_ASM96113v2/GCA_000961135.2_ASM96113v2_protein.faa.gz
GCA_000175535.1,GCA_000175535.1 Chlamydia muridarum MopnTet14 (agent of mouse pneumonitis) strain=MopnTet14,dna,a1a8f1c6dc56999c73fe298871c963d1,GCA_000175535.1_genomic.urlsketch.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/175/535/GCA_000175535.1_ASM17553v1/GCA_000175535.1_ASM17553v1_genomic.fna.gz
```
> Six columns must be present:
> - `accession` - an accession or unique identifier. Ideally no spaces.
> - `name` - full name for the sketch.
> - `moltype` - is the file 'dna' or 'protein'?
> - `md5sum` - expected md5sum (optional, will be checked after download if provided)
> - `download_filename` - filename for FASTA download. Required if `--keep-fastas`, but useful for signatures, too (saved in sig data).
> - `url` - direct link for the file

### Run:

To run the test accession file at `tests/test-data/acc-url.csv`, run:
```
sourmash scripts urlsketch tests/test-data/acc-url.csv -o test-urlsketch.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1
```

Full Usage:
```
usage: urlsketch [-h] [-q] [-d] [-o OUTPUT] [-f FASTAS] [-k] [--download-only] [--failed FAILED] [-p PARAM_STRING] [-c CORES] [-r RETRY_TIMES] input_csv

download and sketch GenBank assembly datasets

positional arguments:
input_csv a txt file or csv file containing accessions in the first column

options:
-h, --help show this help message and exit
-q, --quiet suppress non-error output
-d, --debug provide debugging output
-o OUTPUT, --output OUTPUT
output zip file for the signatures
-f FASTAS, --fastas FASTAS
Write fastas here
-k, --keep-fasta, --keep-fastq
write FASTA/Q files in addition to sketching. Default: do not write FASTA files
--download-only just download genomes; do not sketch
--failed FAILED csv of failed accessions and download links (should be mostly protein).
-p PARAM_STRING, --param-string PARAM_STRING
parameter string for sketching (default: k=31,scaled=1000)
-c CORES, --cores CORES
number of cores to use (default is all available)
-r RETRY_TIMES, --retry-times RETRY_TIMES
number of times to retry failed downloads
```

## Code of Conduct
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ build-backend = "maturin"

[project.entry-points."sourmash.cli_script"]
gbsketch = "sourmash_plugin_directsketch:Download_and_Sketch_Assemblies"
urlsketch = "sourmash_plugin_directsketch:Download_and_Sketch_Url"

[project.optional-dependencies]
test = [
Expand Down
Loading