Skip to content

Commit

Permalink
Merge pull request #34 from sourmash-bio/urlsketch
Browse files Browse the repository at this point in the history
`urlsketch` for downloading from any url, rather than just the genbank assembly datasets (`gbsketch`)

Notes:
- changes failure output format slightly! the new header is: `accession,name,moltype,md5sum,download_filename,url`, which matches the `urlsketch` input format.

- fixes #20
  • Loading branch information
bluegenes authored May 21, 2024
2 parents 4fce83f + 1298edb commit c93112e
Show file tree
Hide file tree
Showing 9 changed files with 835 additions and 55 deletions.
76 changes: 67 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,25 @@
[![DOI](https://zenodo.org/badge/792101561.svg)](https://zenodo.org/doi/10.5281/zenodo.11165725)


tl;dr - download and sketch NCBI Assembly Datasets by accession
tl;dr - download and sketch data directly

## About

This plugin is an attempt to improve database generation by downloading assemblies, checking md5sum, and sketching to a sourmash zipfile. FASTA files can also be saved if desired. It's quite fast, but still very much at alpha level. Here be dragons.
Commands:

- `gbsketch` - download and sketch NCBI Assembly Datasets by accession
- `urlsketch` - download and sketch directly from a url

This plugin is an attempt to improve sourmash database generation by downloading files, checking md5sum if provided or accessible, and sketching to a sourmash zipfile. FASTA files can also be saved if desired. It's quite fast, but still very much at alpha level. Here be dragons.

## Installation

```
pip install sourmash_plugin_directsketch
```

## Usage
## `gbsketch`
download and sketch NCBI Assembly Datasets by accession

### Create an input file

Expand All @@ -43,15 +49,13 @@ For reference:

To run the test accession file at `tests/test-data/acc.csv`, run:
```
sourmash scripts gbsketch tests/test-data/acc.csv -o test.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1
sourmash scripts gbsketch tests/test-data/acc.csv -o test-gbsketch.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1
```

Full Usage:

```
usage: gbsketch [-h] [-q] [-d] -o OUTPUT [-f FASTAS] [-k] [--download-only] [--failed FAILED] [-p PARAM_STRING] [-c CORES]
[-r RETRY_TIMES] [-g | -m]
input_csv
usage: gbsketch [-h] [-q] [-d] [-o OUTPUT] [-f FASTAS] [-k] [--download-only] [--failed FAILED] [-p PARAM_STRING] [-c CORES] [-r RETRY_TIMES] [-g | -m] input_csv
download and sketch GenBank assembly datasets
Expand All @@ -66,7 +70,7 @@ options:
output zip file for the signatures
-f FASTAS, --fastas FASTAS
Write fastas here
-k, --keep-fastas write FASTA files in addition to sketching. Default: do not write FASTA files
-k, --keep-fasta write FASTA files in addition to sketching. Default: do not write FASTA files
--download-only just download genomes; do not sketch
--failed FAILED csv of failed accessions and download links (should be mostly protein).
-p PARAM_STRING, --param-string PARAM_STRING
Expand All @@ -76,7 +80,61 @@ options:
-r RETRY_TIMES, --retry-times RETRY_TIMES
number of times to retry failed downloads
-g, --genomes-only just download and sketch genome (DNA) files
-m, --proteomes-only just download and sketch proteome (protein) files
```

## `urlsketch`
download and sketch directly from a url
### Create an input file

First, create a file, e.g. `acc-url.csv` with identifiers, sketch names, and other required info.
```
accession,name,moltype,md5sum,download_filename,url
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-454,dna,47b9fb20c51f0552b87db5d44d5d4566,GCA_000961135.2_genomic.urlsketch.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/961/135/GCA_000961135.2_ASM96113v2/GCA_000961135.2_ASM96113v2_genomic.fna.gz
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-454,protein,fb7920fb8f3cf5d6ab9b6b754a5976a4,GCA_000961135.2_protein.urlsketch.faa.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/961/135/GCA_000961135.2_ASM96113v2/GCA_000961135.2_ASM96113v2_protein.faa.gz
GCA_000175535.1,GCA_000175535.1 Chlamydia muridarum MopnTet14 (agent of mouse pneumonitis) strain=MopnTet14,dna,a1a8f1c6dc56999c73fe298871c963d1,GCA_000175535.1_genomic.urlsketch.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/175/535/GCA_000175535.1_ASM17553v1/GCA_000175535.1_ASM17553v1_genomic.fna.gz
```
> Six columns must be present:
> - `accession` - an accession or unique identifier. Ideally no spaces.
> - `name` - full name for the sketch.
> - `moltype` - is the file 'dna' or 'protein'?
> - `md5sum` - expected md5sum (optional, will be checked after download if provided)
> - `download_filename` - filename for FASTA download. Required if `--keep-fastas`, but useful for signatures, too (saved in sig data).
> - `url` - direct link for the file
### Run:

To run the test accession file at `tests/test-data/acc-url.csv`, run:
```
sourmash scripts urlsketch tests/test-data/acc-url.csv -o test-urlsketch.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1
```

Full Usage:
```
usage: urlsketch [-h] [-q] [-d] [-o OUTPUT] [-f FASTAS] [-k] [--download-only] [--failed FAILED] [-p PARAM_STRING] [-c CORES] [-r RETRY_TIMES] input_csv
download and sketch GenBank assembly datasets
positional arguments:
input_csv a txt file or csv file containing accessions in the first column
options:
-h, --help show this help message and exit
-q, --quiet suppress non-error output
-d, --debug provide debugging output
-o OUTPUT, --output OUTPUT
output zip file for the signatures
-f FASTAS, --fastas FASTAS
Write fastas here
-k, --keep-fasta, --keep-fastq
write FASTA/Q files in addition to sketching. Default: do not write FASTA files
--download-only just download genomes; do not sketch
--failed FAILED csv of failed accessions and download links (should be mostly protein).
-p PARAM_STRING, --param-string PARAM_STRING
parameter string for sketching (default: k=31,scaled=1000)
-c CORES, --cores CORES
number of cores to use (default is all available)
-r RETRY_TIMES, --retry-times RETRY_TIMES
number of times to retry failed downloads
```

## Code of Conduct
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ build-backend = "maturin"

[project.entry-points."sourmash.cli_script"]
gbsketch = "sourmash_plugin_directsketch:Download_and_Sketch_Assemblies"
urlsketch = "sourmash_plugin_directsketch:Download_and_Sketch_Url"

[project.optional-dependencies]
test = [
Expand Down
Loading

0 comments on commit c93112e

Please sign in to comment.