Roadmap

This page contains notes for the planned future development of pyani

Index

Interface
Database Storage

Interface

The current interface for pyani scripts is to call either the average_nucleotide_identity.py or genbank_get_genomes_by_taxon.py scripts with a combination of arguments. For the average_nucleotide_identity.py script in particular there are arguments that either perform a stage in the total analysis, or prevent a stage from executing. I would like to change this interface to a pyani.py COMMAND OPTIONS structure, similar to git and other tools.

More specificially, I would like to enable operations such as:

pyani.py download -t 931 -o my_organism: download all NCBI assemblies under taxon 931 to the directory my_organism
pyani.py index my_organism: generate MD5 or other hashes for each genome in the directory my_organism
pyani.py anim my_organism -o my_organism_ANIm --scheduler SGE: conduct ANIm analysis on the genomes in the directory my_organism
pyani.py anib my_organism -o my_organism_ANIb --scheduler SGE: conduct ANIb analysis on the genomes in the directory my_organism
pyani.py render my_organism_ANIm --gmethod seaborn: draw graphical output for the ANIm analysis in the directory my_organism_ANIm
pyani.py classify my_organism_ANIm: conduct classification analysis of ANIm results in the directory my_organism_ANIm
pyani.py db --setdb my_db: specify the database (sqlite3?) to hold comparison data; create it if it does not exist
pyani.py db --update my_organism_ANIm: update the current comparison data database with the results contained in my_organism_ANIm - this might be useful after a partial run/failure.

Some modifications to the options are also desirable:

specify multiple input directories
specify multiple class/label files

Database Storage

I have a goal to store all the comparison results in a persistent database, so that incremental additions to existing analyses are made easier, and that partially complete jobs can be resumed.

General Implementation

A specific sqlite3 database is designated as 'current' for any analysis (e.g. with pyani.py db --setdb <location>)
The default database location could be .pyani/pyanidb in the root directory for the analysis (other configuration/debug information may go into .pyani)
The database will recognise a MD5 (or other) hash as representing a unique input genome. This will require all input genomes to be 'indexed'
- The indexing may be performed during download with pyani download
- Indexing may be forced with pyani index <directory>
Previously-seen genomes will be stored in a table (a separate table of their previously-seen locations will be kept)
Comparisons between genome pairs will be recorded in a table, indicating the tool (MUMmer, BLAST+, etc.) and date (which may be used to force a recomparison if requested)
For each comparison, we will record in another table the values that are currently recorded in the output .tab files
Anticipated tables:
- genomes: hashes of genome sequences
- paths: known paths for each hash, keyed by hash from genomes
- comparisons: pairwise comparisons conducted, multikeyed by query and subject genomes from genomes, with a column describing the comparison (and options used)
- data: pairwise comparison results: identity, coverage, mismatches, etc. - what is currently reported in .tab files

This database will allow rapid identification of which analyses have been performed before, negating the need to redo comparisons.

It will also provide a persistent record of comparisons which can be accessed for downstream analyses using, e.g. pyani render and a set of genome files (or list of their hashes?). This will allow ready subsetting of outputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap

Index

Interface

Database Storage

General Implementation

Tables

Clone this wiki locally