Skip to content

Commit

Permalink
add info on updated csv format, ftp_path
Browse files Browse the repository at this point in the history
  • Loading branch information
bluegenes committed May 13, 2024
1 parent 9476bc8 commit 4199ce0
Showing 1 changed file with 19 additions and 5 deletions.
24 changes: 19 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,29 @@ pip install sourmash_plugin_directsketch
```

## Usage

### Create an input file

First, create a file, e.g. `acc.csv` with GenBank identifiers and sketch names.
```
ident,name
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-45
GCA_000175555.1,GCA_000175555.1 ACUK01000506.1 Saccharolobus solfataricus 98/2
accession,name,ftp_path
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-45,
GCA_000175555.1,GCA_000175555.1 ACUK01000506.1 Saccharolobus solfataricus 98/2,
```
> Extra columns are ok, as long as the first two columns contain the identifier and sketch name
> The three column names, `accession`, `name`, and `ftp_path` must be present, but there does not need to be any information in the `ftp_path` column.
No additional columns may be present.

Run:
#### What is ftp_path?

If you do not provide an `ftp_path`, `gbsketch` will use the accession to find the `ftp_path` for you.

If you choose to provide it, `ftp_path` must be the `ftp_path` column from NCBI's assembly summary files.

For reference:
- example `ftp_path`: `https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/036/600/915/GCA_036600915.1_ASM3660091v1`
- bacteria assembly summary file: `https:ftp://ftp.ncbi.nih.gov/genomes/genbank/bacteria/assembly_summary.txt`

### Run:

To run the test accession file at `tests/test-data/acc.csv`, run:
```
Expand Down

0 comments on commit 4199ce0

Please sign in to comment.