Skip to content
Bailey Harrington edited this page Apr 29, 2022 · 12 revisions

Current roadmap (2022)

Milestone 0.3.0

  • new command-line API
  • functionality similar to v0.2.x

Subcommands

Issue PR
123 338 ANIb(lastall)
124 TETRA
150 Classify
378 387 Alembic
-- 364 Compare

Features

Issue PR
146 Config file
215 SLURM support
147 Pipe 3rd party output to temp location

Questions

Issue PR
151 ANIm metric status

Bugs

Issue PR
373 376 ANIm should not be symmetric

Misc.

Issue PR
145 Warnings for 0-identity comparisons
188 Propagate labels for taxon determination
392 Rationalise documentation
152 Update logging exceptions
248 Make v0.2 default branch; rename master to v0.3
194 Adopt concurrent.futures in place of multiprocessing

Close?

Issue PR
221 Missing labels and captions in plots with default settings
129 ANIm: check class/label files before loading sequences

Milestone 0.3.1

  • Extension of pyani v0.3.0 to add new functionality and outputs

Subcommands

Issue PR
187 370 Tree
180 Evolve
135 Subsample

Features

Issue PR
136 Use JSON for labels/classes files
116 Order rows and columns in clustering order like images
94 Fetching only N genomes
343 --dry-run flag

Bugs

Issue PR
14 Collating results is slow for large datasets (>1500 genomes)

Milestone 0.3.2

  • Extension of pyani v0.3.1 to accommodate alternative measures of similarity

Subcommands

Issue PR
156 wANI
155 gANI
137 mash
16 AAI

Milestone 0.3.3

  • Flask interface onto pyani database.

Features

Issue PR
148 Flask interface onto SQLite3 backend

Previous roadmap (2017)

This page contains notes for the planned future development of pyani

Index

Interface

The current interface for pyani scripts is to call either the average_nucleotide_identity.py or genbank_get_genomes_by_taxon.py scripts with a combination of arguments. For the average_nucleotide_identity.py script in particular there are arguments that either perform a stage in the total analysis, or prevent a stage from executing. I would like to change this interface to a pyani.py COMMAND OPTIONS structure, similar to git and other tools.

More specificially, I would like to enable operations such as:

  • pyani.py download -t 931 -o my_organism: download all NCBI assemblies under taxon 931 to the directory my_organism
  • pyani.py index my_organism: generate MD5 or other hashes for each genome in the directory my_organism
  • pyani.py anim my_organism -o my_organism_ANIm --scheduler SGE: conduct ANIm analysis on the genomes in the directory my_organism
  • pyani.py anib my_organism -o my_organism_ANIb --scheduler SGE: conduct ANIb analysis on the genomes in the directory my_organism
  • pyani.py render my_organism_ANIm --gmethod seaborn: draw graphical output for the ANIm analysis in the directory my_organism_ANIm
  • pyani.py classify my_organism_ANIm: conduct classification analysis of ANIm results in the directory my_organism_ANIm
  • pyani.py db --setdb my_db: specify the database (sqlite3?) to hold comparison data; create it if it does not exist
  • pyani.py db --update my_organism_ANIm: update the current comparison data database with the results contained in my_organism_ANIm - this might be useful after a partial run/failure.

Some modifications to the options are also desirable:

  • specify multiple input directories
  • specify multiple class/label files

Database Storage

I have a goal to store all the comparison results in a persistent database, so that incremental additions to existing analyses are made easier, and that partially complete jobs can be resumed.

General Implementation

  • A specific sqlite3 database is designated as 'current' for any analysis (e.g. with pyani.py db --setdb <location>)
  • The default database location could be .pyani/pyanidb in the root directory for the analysis (other configuration/debug information may go into .pyani)
  • The database will recognise a MD5 (or other) hash as representing a unique input genome. This will require all input genomes to be 'indexed'
    • The indexing may be performed during download with pyani download
    • Indexing may be forced with pyani index <directory>
  • Previously-seen genomes will be stored in a table (a separate table of their previously-seen locations will be kept)
  • Comparisons between genome pairs will be recorded in a table, indicating the tool (MUMmer, BLAST+, etc.) and date (which may be used to force a recomparison if requested)
  • For each comparison, we will record in another table the values that are currently recorded in the output .tab files
  • Anticipated tables:
    • genomes: hashes of genome sequences
    • paths: known paths for each hash, keyed by hash from genomes
    • comparisons: pairwise comparisons conducted, multikeyed by query and subject genomes from genomes, with a column describing the comparison (and options used)
    • data: pairwise comparison results: identity, coverage, mismatches, etc. - what is currently reported in .tab files

This database will allow rapid identification of which analyses have been performed before, negating the need to redo comparisons.

It will also provide a persistent record of comparisons which can be accessed for downstream analyses using, e.g. pyani render and a set of genome files (or list of their hashes?). This will allow ready subsetting of outputs.

Tables