Skip to content
Leighton Pritchard edited this page Apr 28, 2017 · 12 revisions

This page contains notes for the planned future development of pyani

Index

Interface

The current interface for pyani scripts is to call either the average_nucleotide_identity.py or genbank_get_genomes_by_taxon.py scripts with a combination of arguments. For the average_nucleotide_identity.py script in particular there are arguments that either perform a stage in the total analysis, or prevent a stage from executing. I would like to change this interface to a pyani.py COMMAND OPTIONS structure, similar to git and other tools.

More specificially, I would like to enable operations such as:

  • pyani.py download -t 931 -o my_organism: download all NCBI assemblies under taxon 931 to the directory my_organism
  • pyani.py index my_organism: generate MD5 or other hashes for each genome in the directory my_organism
  • pyani.py anim my_organism -o my_organism_ANIm --scheduler SGE: conduct ANIm analysis on the genomes in the directory my_organism
  • pyani.py anib my_organism -o my_organism_ANIb --scheduler SGE: conduct ANIb analysis on the genomes in the directory my_organism
  • pyani.py render my_organism_ANIm --gmethod seaborn: draw graphical output for the ANIm analysis in the directory my_organism_ANIm
  • pyani.py classify my_organism_ANIm: conduct classification analysis of ANIm results in the directory my_organism_ANIm
  • pyani.py db --setdb my_db: specify the database (sqlite3?) to hold comparison data; create it if it does not exist
  • pyani.py db --update my_organism_ANIm: update the current comparison data database with the results contained in my_organism_ANIm - this might be useful after a partial run/failure.

Some modifications to the options are also desirable:

  • specify multiple input directories
  • specify multiple class/label files

Database Storage

I have a goal to store all the comparison results in a persistent database, so that incremental additions to existing analyses are made easier, and that partially complete jobs can be resumed.

General Implementation

  • A specific sqlite3 database is designated as 'current' for any analysis (e.g. with pyani.py db --setdb <location>)
  • The default database location could be .pyani/pyanidb in the root directory for the analysis (other configuration/debug information may go into .pyani)
  • The database will recognise a MD5 (or other) hash as representing a unique input genome. This will require all input genomes to be 'indexed'
    • The indexing may be performed during download with pyani download
    • Indexing may be forced with pyani index <directory>
  • Previously-seen genomes will be stored in a table (a separate table of their previously-seen locations will be kept)
  • Comparisons between genome pairs will be recorded in a table, indicating the tool (MUMmer, BLAST+, etc.) and date (which may be used to force a recomparison if requested)
  • For each comparison, we will record in another table the values that are currently recorded in the output .tab files
  • Anticipated tables:
    • genomes: hashes of genome sequences
    • paths: known paths for each hash, keyed by hash from genomes
    • comparisons: pairwise comparisons conducted, multikeyed by query and subject genomes from genomes, with a column describing the comparison (and options used)
    • data: pairwise comparison results: identity, coverage, mismatches, etc. - what is currently reported in .tab files

This database will allow rapid identification of which analyses have been performed before, negating the need to redo comparisons.

It will also provide a persistent record of comparisons which can be accessed for downstream analyses using, e.g. pyani render and a set of genome files (or list of their hashes?). This will allow ready subsetting of outputs.

Tables