Skip to content

Writing a new submodule (subcommand)

Bailey Harrington edited this page Jul 28, 2022 · 4 revisions

Note: This page describes the parts of a pyani subcommand and the required (additions to) files in general terms first; then goes into more specifics. There is also a branch, subcmd_blueprint, that shows the additions necessary to create a mock subcommand called 'blueprint'. The relevant bits of code can be found in the files that contain blueprint in the name, or by searching for blueprint in the codebase (e.g., using grep). Additions to existing files are preceded by comments containing this word.

General overview

Parts of the pyani file structure that will be discussed are shown here:

.
|
β”œβ”€β”€ docs
β”‚   └── <doc files>
β”œβ”€β”€ pyani
β”‚   β”œβ”€β”€ pyani_graphics
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ scripts
β”‚   β”‚   β”œβ”€β”€ parsers
β”‚   β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”‚   └── <parsers>
β”‚   β”‚   β”œβ”€β”€ subcommands
β”‚   β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”‚   └── <subcommands>
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── <other scripts>
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── <API files>
β”œβ”€β”€ scratch
β”‚   └── ...
β”œβ”€β”€ tests
β”‚   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ conftest.py
β”‚   β”œβ”€β”€ README.md
β”‚   └── <test files>
β”œβ”€β”€ ...
β”œβ”€β”€ README.md
β”œβ”€β”€ <requirements files>
└── ...

When adding a submodule, there are several parts that need to be written and connected to the rest of the project. These can be generally thought of as:

  • the code,
  • the infrastructure,
  • the configuration,
  • the tests, and
  • the documentation.

The necessary components

A more detailed view of what constitutes each of the parts listed above is given below.

Code

These will be a group of .py files within the pyani directory, that perform the actual function of the new subcommand.

  • program for command line execution
  • API points for interactive use
  • performs necessary I/O
  • talks to the database
  • runs analyses via third-party tools
  • et cetera

Infrastructure

This will consist of some new .py files and additions to existing .py files.

  • relevant parser(s)
  • help information
  • all of the code necessary to tell python this is a submodule

Configuration

Depends on what the subcommand actually does, and any third-party tools it uses. Potential file types vary, but .json, .yaml, .toml, .ini, or even .txt are likely formats for config files.

  • other files that may be necessary for the specific feature being implemented (e.g., config or .ini files)

Tests

These will be located in a combination of new and existing .py files within the tests directory.

  • unit tests to test individual bits of code do what is expected and are consistent
  • integration tests to test submodules fit together as expected

Documentation

Documentation will be added in several places, including the Readme, within the code itself, new and existing files within the docs directory, and on the wiki.

  • include helpful comments in the code as it is written
  • any changes to the Readme (most things will not require changes here)
  • add new pages in ReadTheDocs source
  • add useful, related information for developers and users to the wiki

Workflow

Code, infrastructure, and configuration

Different people may write these elements in different orders, depending on their thought process. The suggestion put forth here is:

  • Start with the infrastructure. This makes it possible to run the submodule at different points in writing the implementation details to see the current state.
  • Sketch out the codeβ€”create placeholders for functions and objects that will be needed; write comments describing the steps the submodule needs to go through; et cetera.
  • Flesh out the code (e.g., the equivalents of the subcommand file: subcmd_fastani.py; and the API file: fastani.py), and assemble necessary configuration files. During active development, it may be easier to focus on the subcommand file alone, pulling out bits of code that might belong in the API later on into helper functions, rather than actively writing two files.

Documentation and testing

The choice of when to write documentation and testing is more open:

  • Documentation can be written before (prospectively), during, or after (retrospectively) development; however, if it is written before or during the writing of the code it documents, it will be necessary to then audit it later, in order to ensure it matches with the final version of the code.
  • Testing can similarly be written at different points, and many people advocate for writing tests first as a way of informing development (this is called Test-driven development). pyani development does not follow any specific philosophy here, but some of the CircleCI steps will annotate code that is missed by the current testing suite.

Detailed steps

Below, the steps for creating a new subcommand called blueprint are outlined in detail.

Infrastructure

  1. Create and populate a file for the parser: pyani/scripts/parsers/blueprint_parser.py.
    • use this example file as a guide so the parser is constructed how pyani expects it to be
  2. Add the new parser to pyani/scripts/parsers/__init__.py.
    • as an import
    • in the docstring for parse_cmdline()
    • into the body of parse_cmdline()
  3. Add the new subcommand into pyani/pyani_config.
    • default paths to any executables
    • any other environmental variables specific to this subcommand

Code

To test that everything in the infrastructure is working, first do these things:

  1. Create and populate pyani/scripts/subcommands/subcmd_blueprint.py.
    • add a function subcmd_blueprint() with an empty body
  2. Import this function in pyani/scripts/subcommands/__init__.py.

Note: At this point, it should be possible to run pyani -h and see blueprint in the list of valid subcommands. It will also be possible to run pyani blueprint -h and see any options already defined in the output. Having this set up and working should help immensely when writing the actual subcommand.

Once the new subcommand shows up in the command line, start sketching out the code needed for it, and subsequently filling it out.

  1. Create a skeleton of comments outlining the steps the subcommand needs to follow. Looking at comments and logging statements in other subcommands will help here. The end goal is essentially a script that performs the analysis/action of the subcommand with no user interaction. General things that should happen often include:

    • announcing the subcommand is starting
    • checking and reporting the version for any executables that will be used
    • generate a unique name for the analysis (not applicable to all subcommands)
    • connecting to the database (applies to most subcommands)
    • add information about the analysis to the database (if applicable)
    • identify input files
    • generate command lines
    • create output directories
    • check whether some commands don't need to be run because the results are already in the database
    • create list of jobs
    • pass jobs to scheduler (if appropriate)
    • update database with results (if applicable)
  2. Start fleshing out subcmd_blueprint(). Temporarily putting something at the end like a print statement or raise NotImplementedError that will make it clear the end has been reached when it runs can be a good idea.

    • this will likely not be a linear process
    • add lots of explanatory comments throughout, documenting choices that are made, and why
    • it might become clear that some bits of code should be helper functions, or part of the API
  3. Identify specific tasks related to the subcommand and plan for them to be housed in pyani/blueprint.py, which will hold the API points. This allows users to implement the different analyses/actions provided with pyani, or individual tasks performed as part of the subcommand script, in an interactive session using a python interpreter. Examples include:

    • get_version()
    • functions that construct command lines
    • functions that parse results files and load results into the database

Adding dependencies

In the course of adding code, it may be that new dependencies need to be added to pyani. This is perfectly fine, though worth considering whether the dependency is truly necessary, or whether the same thing can be accomplished by one that already exists (in the interest of keeping pyani smaller).

If the dependency does need to be added, it will be appended to one of the requirements*.txt files. Which file, depends on the type of dependency, and where it can be installed from.

  • General dependency needed for running `pyani``
    • Most dependencies will be listed in requirements.txt. Do this, unless one of the following points applies.
  • New third-party tool
    • If it is a new third-party tool, it will be listed in requirements-thirdparty.txt
  • Specific to developing pyani.
    • If the requirement is only needed for actively developing pyani (this includes being able to run the test suite), it belongs in requirements-dev.txt.
  • Only installable via `pip``
    • If the requirement can not be installed via conda, it should be added to requirements-pip.txt.
  • Other cases
    • Edge cases sometimes arise that require different treatment. This has been the case with two dependencies thus far:
      • fastANI, where there is a known situation where installation fails, and including it in a file with other dependencies would result in none of them being installed in that situation; and
      • pyqt, where it must be specified differently for installation via pip and conda.

Tests

At some point, it will be necessary to start thinking about writing tests for pyani blueprint. This will probably be after at least some of the code is written and it is clearer which things need to be tested.

The goal with writing a test is to see that, given a known input, some portion of the code produces a predictable result. Note: Successfully predicting the result does not mean the test is correctly written, nor that it covers all cases. Edge cases may be imagined during the writing of tests, and subsequently accounted for; but in practice, they are just as likely to be found later on, as bugs.

pyani uses two kinds of tests: unit tests, and integration tests. For pyani blueprint, the following files may be created/modified:

  • pyani/tests/test_blueprint.py β€”Β houses unit tests, mostly for the API points
  • pyani/tests/test_subcmd_XX_blueprint.py β€”Β houses integration tests
  • pyani/tests/conftest.py β€”Β (already exists) contains tests and mocked objects that are useful for testing more than one subcommand
  • pyani/tests/fixtures/blueprint/ β€”Β a directory to hold any files required to create necessary pytest fixtures
  • pyani/tests/target_blueprint_output/ β€” a directory to hold a full set of results to be used as 'correct' values for integration test comparison
  • pyani/tests/test_input/blueprint/ β€” a directory to hold a full set of inputs for a pyani blueprint run
  • pyani/tests/test_output/subcmd_blueprint β€” a non-tracked directory to which any results can be written
  • pyani/tests/test_targets/subcmd_blueprint β€” a directory to hold any results to be used as 'correct' values for unit test comparison

Unit tests target a small portion of the code; one function, or a single line; some self-contained block. In order to write a unit test, one must:

  • identify an entry point into the code (such as a function call)
  • provide all necessary inputs for use of that entry point (generally done with a combination of real and mocked inputs)
  • identify the output (a result value, a logging statement, a raised exception) that indicates the code has behaved correctly for the inputs
  • perform a comparison between the result of running the code from the entry point, with the test inputs, to the expected result