Skip to content
Bailey Harrington edited this page May 22, 2022 · 12 revisions

Current roadmap (2022)

Mileston 0.2.12

  • updates to version 2 distribution on PyPI and bioconda

Bugs

Issue Branch PR
248 Make version_0_2 the default branch; rename master
396 Checklist of steps for this update

Milestone 0.3.0

  • new command-line API
  • functionality similar to v0.2.x

Subcommands

Issue Branch PR
123 anib_123 338 ANIb(lastall)
124 TETRA
150 Classify
378 alembic_378 387 Alembic
-- compare 364 Compare

Features

Issue Branch PR
146 Config file
215 SLURM support
147 Pipe 3rd party output to temp location

Questions

Issue Branch PR
151 ANIm metric status

Bugs

Issue Branch PR
373 issue_373 376 ANIm should not be symmetric
383 issue_383  385 try/except around extraction in pyani download
371 ValueError: zero-size array to reduction operation minimum which has no identity
342 issue_342, noextend_342 Use --noextend in NUCmer as a rule
340 Alignment coverage >1.0

Misc.

Issue Branch PR
145 Warnings for 0-identity comparisons
188 Propagate labels for taxon determination
392 Rationalise documentation
152 Update logging exceptions
194 Adopt concurrent.futures in place of multiprocessing

Close?

Issue Branch PR
221 Missing labels and captions in plots with default settings
129 ANIm: check class/label files before loading sequences

Milestone 0.3.1

  • Extension of pyani v0.3.0 to add new functionality and outputs

Subcommands

Issue Branch PR
187 tree_186 370 Tree (branch named for a now-closed issue)
180 Evolve
135 Subsample
362 Add tests for --recovery mode

Features

Issue Branch PR
136 Use JSON for labels/classes files
116 Order rows and columns in clustering order like images
94 Fetching only N genomes
343 --dry-run flag

Bugs

Issue Branch PR
14 Collating results is slow for large datasets (>1500 genomes)
306 NUCmer job generation for large jobs slows down rapidly

Milestone 0.3.2

  • Extension of pyani v0.3.1 to accommodate alternative measures of similarity

Subcommands

Issue Branch PR
156 wANI
155 gANI
137 mash
16 AAI

Milestone 0.3.3

  • Flask interface onto pyani database.

Features

Issue Branch PR
148 Flask interface onto SQLite3 backend

Previous roadmap (2017)

This page contains notes for the planned future development of pyani

Index

Interface

The current interface for pyani scripts is to call either the average_nucleotide_identity.py or genbank_get_genomes_by_taxon.py scripts with a combination of arguments. For the average_nucleotide_identity.py script in particular there are arguments that either perform a stage in the total analysis, or prevent a stage from executing. I would like to change this interface to a pyani.py COMMAND OPTIONS structure, similar to git and other tools.

More specificially, I would like to enable operations such as:

  • pyani.py download -t 931 -o my_organism: download all NCBI assemblies under taxon 931 to the directory my_organism
  • pyani.py index my_organism: generate MD5 or other hashes for each genome in the directory my_organism
  • pyani.py anim my_organism -o my_organism_ANIm --scheduler SGE: conduct ANIm analysis on the genomes in the directory my_organism
  • pyani.py anib my_organism -o my_organism_ANIb --scheduler SGE: conduct ANIb analysis on the genomes in the directory my_organism
  • pyani.py render my_organism_ANIm --gmethod seaborn: draw graphical output for the ANIm analysis in the directory my_organism_ANIm
  • pyani.py classify my_organism_ANIm: conduct classification analysis of ANIm results in the directory my_organism_ANIm
  • pyani.py db --setdb my_db: specify the database (sqlite3?) to hold comparison data; create it if it does not exist
  • pyani.py db --update my_organism_ANIm: update the current comparison data database with the results contained in my_organism_ANIm - this might be useful after a partial run/failure.

Some modifications to the options are also desirable:

  • specify multiple input directories
  • specify multiple class/label files

Database Storage

I have a goal to store all the comparison results in a persistent database, so that incremental additions to existing analyses are made easier, and that partially complete jobs can be resumed.

General Implementation

  • A specific sqlite3 database is designated as 'current' for any analysis (e.g. with pyani.py db --setdb <location>)
  • The default database location could be .pyani/pyanidb in the root directory for the analysis (other configuration/debug information may go into .pyani)
  • The database will recognise a MD5 (or other) hash as representing a unique input genome. This will require all input genomes to be 'indexed'
    • The indexing may be performed during download with pyani download
    • Indexing may be forced with pyani index <directory>
  • Previously-seen genomes will be stored in a table (a separate table of their previously-seen locations will be kept)
  • Comparisons between genome pairs will be recorded in a table, indicating the tool (MUMmer, BLAST+, etc.) and date (which may be used to force a recomparison if requested)
  • For each comparison, we will record in another table the values that are currently recorded in the output .tab files
  • Anticipated tables:
    • genomes: hashes of genome sequences
    • paths: known paths for each hash, keyed by hash from genomes
    • comparisons: pairwise comparisons conducted, multikeyed by query and subject genomes from genomes, with a column describing the comparison (and options used)
    • data: pairwise comparison results: identity, coverage, mismatches, etc. - what is currently reported in .tab files

This database will allow rapid identification of which analyses have been performed before, negating the need to redo comparisons.

It will also provide a persistent record of comparisons which can be accessed for downstream analyses using, e.g. pyani render and a set of genome files (or list of their hashes?). This will allow ready subsetting of outputs.

Tables