Roadmap

Current roadmap (2022)

Milestone 0.3.0

new command-line API
functionality similar to v0.2.x

Subcommands

Issue	PR
123	338	ANIb(lastall)
124		TETRA
150		Classify
378	387	Alembic
--	364	Compare

Features

Issue	PR
146		Config file
215		SLURM support
147		Pipe 3rd party output to temp location

Questions

Issue	PR
151		ANIm metric status

Bugs

Issue	PR
373	376	ANIm should not be symmetric

Misc.

Issue	PR
145		Warnings for 0-identity comparisons
188		Propagate labels for taxon determination
392		Rationalise documentation
152		Update logging exceptions
248		Make v0.2 default branch; rename master to v0.3
194		Adopt concurrent.futures in place of multiprocessing

Close?

Issue	PR
221		Missing labels and captions in plots with default settings
129		ANIm: check class/label files before loading sequences

Milestone 0.3.1

Extension of pyani v0.3.0 to add new functionality and outputs

Subcommands

Issue	PR
187	370	Tree
180		Evolve
135		Subsample

Features

Issue	PR
136		Use JSON for labels/classes files
116		Order rows and columns in clustering order like images
94		Fetching only N genomes
343		--dry-run flag

Bugs

Issue	PR
14		Collating results is slow for large datasets (>1500 genomes)

Milestone 0.3.2

Extension of pyani v0.3.1 to accommodate alternative measures of similarity

Subcommands

Issue	PR
156		wANI
155		gANI
137		mash
16		AAI

Milestone 0.3.3

Flask interface onto pyani database.

Features

Issue	PR
148		Flask interface onto SQLite3 backend

Previous roadmap (2017)

This page contains notes for the planned future development of pyani

Interface

The current interface for pyani scripts is to call either the average_nucleotide_identity.py or genbank_get_genomes_by_taxon.py scripts with a combination of arguments. For the average_nucleotide_identity.py script in particular there are arguments that either perform a stage in the total analysis, or prevent a stage from executing. I would like to change this interface to a pyani.py COMMAND OPTIONS structure, similar to git and other tools.

More specificially, I would like to enable operations such as:

pyani.py download -t 931 -o my_organism: download all NCBI assemblies under taxon 931 to the directory my_organism
pyani.py index my_organism: generate MD5 or other hashes for each genome in the directory my_organism
pyani.py anim my_organism -o my_organism_ANIm --scheduler SGE: conduct ANIm analysis on the genomes in the directory my_organism
pyani.py anib my_organism -o my_organism_ANIb --scheduler SGE: conduct ANIb analysis on the genomes in the directory my_organism
pyani.py render my_organism_ANIm --gmethod seaborn: draw graphical output for the ANIm analysis in the directory my_organism_ANIm
pyani.py classify my_organism_ANIm: conduct classification analysis of ANIm results in the directory my_organism_ANIm
pyani.py db --setdb my_db: specify the database (sqlite3?) to hold comparison data; create it if it does not exist
pyani.py db --update my_organism_ANIm: update the current comparison data database with the results contained in my_organism_ANIm - this might be useful after a partial run/failure.

Some modifications to the options are also desirable:

specify multiple input directories
specify multiple class/label files

Database Storage

I have a goal to store all the comparison results in a persistent database, so that incremental additions to existing analyses are made easier, and that partially complete jobs can be resumed.

General Implementation

A specific sqlite3 database is designated as 'current' for any analysis (e.g. with pyani.py db --setdb <location>)
The default database location could be .pyani/pyanidb in the root directory for the analysis (other configuration/debug information may go into .pyani)
The database will recognise a MD5 (or other) hash as representing a unique input genome. This will require all input genomes to be 'indexed'
- The indexing may be performed during download with pyani download
- Indexing may be forced with pyani index <directory>
Previously-seen genomes will be stored in a table (a separate table of their previously-seen locations will be kept)
Comparisons between genome pairs will be recorded in a table, indicating the tool (MUMmer, BLAST+, etc.) and date (which may be used to force a recomparison if requested)
For each comparison, we will record in another table the values that are currently recorded in the output .tab files
Anticipated tables:
- genomes: hashes of genome sequences
- paths: known paths for each hash, keyed by hash from genomes
- comparisons: pairwise comparisons conducted, multikeyed by query and subject genomes from genomes, with a column describing the comparison (and options used)
- data: pairwise comparison results: identity, coverage, mismatches, etc. - what is currently reported in .tab files

This database will allow rapid identification of which analyses have been performed before, negating the need to redo comparisons.

It will also provide a persistent record of comparisons which can be accessed for downstream analyses using, e.g. pyani render and a set of genome files (or list of their hashes?). This will allow ready subsetting of outputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap

Current roadmap (2022)

Milestone 0.3.0

Subcommands

Features

Questions

Bugs

Misc.

Close?

Milestone 0.3.1

Subcommands

Features

Bugs

Milestone 0.3.2

Subcommands

Milestone 0.3.3

Features

Previous roadmap (2017)

Index

Interface

Database Storage

General Implementation

Tables

Clone this wiki locally