AGR's Proteins Annotations and Variants Inspector
- Architecture
- Development principles and guidelines
- Acknowledgements
- Maintainers
The PAVI repository is a monorepository consisting of all components that make PAVI work, while each components is deployed and scaled independently for better isolation of concerns and to ensure sufficient availability and performance without needing to oversize to handle multiple components concurrently.
PAVI is made up of the following components:
- A Web UI that enables (end-)user interaction
- An API that connects the web UI to the processing pipeline, and serves as job-manager for following up on processing and result retrieval
- pipeline components that comprise the processing pipeline
required to display the requested proteins, annotations and variants.
This forms the heart of PAVI, doing all sequence retrieval, processing, alignments etc.
Each of these components has its required AWS resources defined as code
through AWS CDK, in a aws_infra
subdirectory.
This project is divided in subcomponents which function independently but share similar concepts in their setup.
All components have a Makefile
in their subdirectory that contains all targets for code validation, dependency management,
build and deployment. Below subchapters describe common concepts and make targets used for specific groups of subcomponents.
Application dependencies are defined either the pyproject.toml
file for python dependencies,
or the package.json
file for node.js dependencies.
Furthermore, version specifications should use wildcards or version specifiers such as the compatible
release clause (~=
in python, ~
in node.js) to allow for automatic upgrades to newer patch and optionally
minor versions which are expected not to cause any breakage while improving security and stability.
A lock file must be used to freeze all dependencies to specific versions, and this lock file must be
committed to the repository, so that builds and runs on different environments all result in the same
product, and dependency updates can always be validated before being applied to production environments.
Flexible dependency version specifications as defined above allow for a separation between low-risk version upgrades, which are expected to pass all validations without requiring additional changes, and more high-risk version upgrades, which are more likely to require code changes to make the code work with the upgraded dependency.
To update the dependency lock files to apply the latest available low-risk dependency version upgrades:
# To update all dependency lock files (within a subcomponent)
make update-deps-locks-all
For high-risk upgrades, update the version specified in the pyproject.toml
and/or the package.json
file,
then run the above make target to update the lock file(s).
Run all tests to validate the code still works and update as required if not.
Low-risk updates are automatically applied on PR validation to all pull requests requesting to merge into the main
branch,
unless the no-deps-lock-updates
label is added to the PR.
High-risk upgrades are proposed regularly by dependabot by means of PRs with version update proposals,
as configured in the dependabot.yml
file.
See the Github Docs for more details on the specifications for the dependabot configuration file.
To install all component dependencies (frozen versions defined in the lock file):
# To install application dependencies only
make install-deps
# To install application and test dependencies
make install-test-deps
# To install all dependencies
make install-deps-all
In addition to the general development principles and concepts described above, PAVI components using python use the following python-specific general concepts.
All python components use virtual environments to isolate the build- and application dependencies from the global system python setup. Make targets to create these virtual environments can be found in the PAVI root Makefile. However, these do not need to be created manually, as they are automatically created as and when appropriate (when calling Make targets requiring them).
- The
.venv/
directory is used as virtual environment for application dependencies. - The
.venv-build/
directory is used as virtual environment for dependency management requirements such as pip-tools, which are installed independently of the application dependencies.
Make targets depending on these virtual environments will and should use the binaries and libraries installed in these virtual environment, without requiring them to be activated.
The virtual environment can be activated environment-wide by calling below command,
should this be needed for development or troubleshooting purpose (VSCode will active
the application .venv
automatically when opening a new terminal for that directory.
source .venv/bin/activate
Once the virtual environment is activated, all python command will now automatically use the python binaries and dependencies installed in this isolated virtual environment.
To deactivate an active virtual environment run the deactivate
command.
Application dependencies are defined in the pyproject.toml
file and should follow the python guidelines.
This is required to ensure compatibility with external dependency managers such as dependabot.
At time of writing, dependency specifications used by poetry 1.* do not adhere to the python guidelines
(specifically, its pyproject.toml
file usage is not PEP-621 compatible), which makes it incompatible
with dependabot (which can only update the poetry.lock
file but not the dependency versions specified
in the pyproject.toml
file as a consequence). Due to this, the decision was made not to use poetry,
at least until it becomes PEP-621 compatible (which is expected to be from the poetry 2.* release).
As an alternative, all PAVI python components currently use pip-tools
for dependency management.
This is done by converting the flexible dependency specifiers from the pyproject.toml
file to
frozen versions in requirements.txt
files. As a consequence, dependabot can not distinguish project
dependencies from subdependencies (something that is possible through poetry) and will propose updates
for subdependencies where that may not be appropriate. Such subdependencies must be added to the relevant
ignore:
sections in the dependabot.yml configuration file to disable such update proposals.
All modules, functions, classes and methods should have their input, attributes and output documented through docstrings to make the code easy to read and understand for anyone reading it. To ensure this is done in a uniform way accross all code, follow the Google Python Style Guide on docstrings.
TL;DR
Use Python type hints wherever possible, mypy
is used to enforce this on PR validation.
To check all code typing in a PAVI submodule, run:
make run-type-checks
Detailed explanation
While Python uses dynamic typic at runtime, it is recommended to use type hints
to declare intended types. These can be used by IDEs and type checkers to provide code completion,
usage hints and warning where incompatible types are used.
This provides a way to catch more potential bugs during development, before they arise in deployed
environments and require tedious troubleshooting.
As an example, the following untyped python function will define local cache usage behavior taking a boolean as toggle:
def set_local_cache_reuse(reuse):
"""
Define whether or not data_file_mover will reuse files in local cache where already available pre-execution.
Args:
reuse (bool): set to `True` to enable local cache reuse behavior (default `False`)
"""
global _reuse_local_cache
_reuse_local_cache = reuse
if reuse:
print("Local cache reuse enabled.")
else:
print("Local cache reuse disabled.")
When called with boolean values, this function works just fine:
>>> set_local_cache_reuse(True)
Local cache reuse enabled.
>>> set_local_cache_reuse(False)
Local cache reuse disabled.
However when calling with a String instead of a boolean, you may get unexpected behaviour:
>>> set_local_cache_reuse("False")
Local cache reuse enabled.
This happens because Python dynamically types and converts types at runtime,
and all strings except empty ones are converted to boolean value True
.
To prevent this, add type hints to your code:
def set_local_cache_reuse(reuse: bool) -> None:
"""
Define whether or not data_file_mover will reuse files in local cache where already available pre-execution.
Args:
reuse (bool): set to `True` to enable local cache reuse behavior (default `False`)
"""
global _reuse_local_cache
_reuse_local_cache = reuse
if reuse:
print("Local cache reuse enabled.")
else:
print("Local cache reuse disabled.")
set_local_cache_reuse("False")
Type hints themselves are not enforced at runtime, and will thus not stop the code from running (incorrectly),
but using myPy
those errors can be revealed before merging this code. Storing the above code snippet in a file
called set_local_cache_reuse.py
and running myPy
on it gives the following result:
> mypy set_local_cache_reuse.py
set_local_cache_reuse.py:9: error: Name "_reuse_local_cache" is not defined [name-defined]
set_local_cache_reuse.py:14: error: Argument 1 to "set_local_cache_reuse" has incompatible type "str"; expected "bool" [arg-type]
Found 2 errors in 1 file (checked 1 source file)
With the myPy output, we can now return to the code and fix the errors reported which would otherwise result in silent unexpected behavior and bugs.
To prevent these sort of unexpected bugs, all PAVI subcomponents must use type hints wherever possible.
mypy
is used a type checker, and is run on all python subcomponents on every PR validation
to ensure code of good quality.
To run type checks:
make run-type-checks
To ensure consistent code styling is used accross components, flake8
is used as linter in all python components.
These style checks are enforced through PR validation, where they need to pass before enabling PR merge.
To run style checks:
make run-style-checks
Unit and integration testing for python components is done through Pytest, and all unit and integration tests must pass before PRs can be approved and merged. A minimum of 80% code coverage is required to ensure new code gets approriate unit testing before getting merged, which ensures the code is functional and won't break unnoticed in future development iterations.
To run unit testing as a developer (generating an inspectable HTML report):
make run-unit-tests-dev
By default, npm
(the default Node.js Package Manager) downloads dependencies into a local node_modules
subdirectory,
and we make us of this feature to isolation dependencies independently for each of the PAVI components.
Dependency management for node depencies is done using npm
.
Application dependencies are defined in the package.json
file (with flexible version specifications),
and frozen in the package-lock.json
file.
As Javascript uses dynamic typic at runtime and does not support native type hints,
TypeScript is used instead as development language,
which is transpiled to javascript on build and deployment.
Using Typescript over plain Javascript adds support for code-completion, usage hints and
warnings on usage of incompatible types by IDEs and type checkers, providing a way to
catch more potential bugs during development, before they arise in deployed environments
and require tedious troubleshooting.
Therefor, all PAVI subcomponents requiring javascript code should use Typescript.
The typescript compiler tsc
is used a type checker, and is run on all Typescript subcomponents
on every PR, to ensure code of good quality.
As Typescript is uses Javascript code with additional syntax for types, this Typescript code is easy to read and write by any Javascript developer.
To run type checks:
make run-type-checks
To ensure consistent code styling is used accross components, eslint
is used as linter in all javascript/typescript components.
These style checks are enforced through PR validation, where they need to pass before enabling PR merge.
To run style checks:
make run-style-checks
Unit and integration testing for javascript/react components is done through jest
,
and all unit and integration tests must pass before PRs can be approved and merged.
To run unit testing as a developer:
make run-unit-tests
All PAVI components are deployed to AWS, and deployment of those components usually depends on certain AWS resources
such as ECR registries to upload the container images to, Elastic Beanstalk Applications to deploy the application services to
or AWS Batch and ECS to execute pipeline jobs.
All these AWS resources are defined as code through AWS CDK,
which can be found in the aws_infra
subdirectory of the respective components' directory.
AWS CDK is an open-source framework that enables writing the entire cloud application as code, including all event sources and other AWS resources which are require to make the application executable in AWS in addition to the application code. This allows for an easy and reproducible deployment, that can be fully defined, versioned and documented as code.
To allow better interoperability and code sharing, all AWS CDK code in PAVI is written in Python, independent of the language used for the component it serves.
All PAVI AWS CDK code depends on the pavi_shared_aws
python package (found in the /shared_aws/py_package/ directory),
which holds all AWS CDK code and classes shared accross multiple components.
Before running or making change to any of the CDK code for PAVI submodules in the aws_infra
directories,
build and install the pavi_shared_aws
package by following the build-and-install instructions in the according README.
While shared AWS CDK code is stored in the pavi_shared_aws
package, shared AWS resource which
are managed by PAVI but used by multiple components (such as the AWS Chatbot configuration)
are defined in the /shared_aws/aws_infra/ directory,
which holds the AWS CDK definitions for those AWS resources.
While the CDK code is written in Python, the CDK CLI which is used for validation and deployment
of the AWS resources defined is installed through npm
, and has its version defined and frozen
in the package.json
and the package-lock.json
files respectively.
To install the CDK CLI used for any component, execute:
make install-cdk-cli
This will install the CDK CLI in a local node dependencies directory,
which means the CLI is not installed globally but instead can be executed through npx cdk
.
Before calling the CDK CLI on any of the CDK python code, ensure the relevant virtual environment,
in which the CDK app dependencies are installed, is activated.
All CDK CLI installations in this repository are installed and tested using the same node.js version used for the web UI, v20 at time of writing.
Here's a list of the most useful CDK CLI commands. For a full list, call cdk help
.
npx cdk ls
list all stacks in the appnpx cdk synth
emits the synthesized CloudFormation templatenpx cdk deploy
deploy this stack to AWSnpx cdk diff
compare deployed stack with current statenpx cdk docs
open CDK documentation
Two standard CDK configuration files can be found at the root level of each aws_infra
directory:
cdk.json
Contains the main CDK execution configuration parameterscdk.context.json
Contains the VPC context in which to deploy the CDK Stack.
Then the AWS Stack to be deployed using CDK is generally defined in the following files and directories:
cdk_app.py
The root level CDK application, defining the entire AWS Stack to be deployed (possibly in multiple copies for multiple environments).cdk_classes/
Python sub-classes defining the parts of the CDK stack (which represents a single CloudFormation stack) and all individual CDK constructs, representing individual cloud components.
All CDK-defined AWS resource defininitions are validated on every PR, and automatically deployed on merge to main.
When making changes to any of the CDK files, validate them before requesting a PR or attempting a deployment.
Note: as part of the validation requires comparison to deployed resources, you need to be authenticateable to AWS before you can run below validation target.
To validate the CDK code run the following command:
make validate-all
This make target will run two things:
-
First it will run the unit tests (through the Makefile's
run-unit-tests
target), which test CDK code for resource definitions exepected by other parts of this repository, to ensure updates to the CDK code don't accidentally remove or rename essential AWS resources. -
After the unit tests pass, it will run
cdk diff
on the production stack, which compares the production stack defined in the code to the deployed stack and displays the changes that would get deployed. Inspect these changes to ensure the code changes made will have the expected effect on the deployed AWS resources. Ascdk diff
will synthesize the full (Cloudformation) stack to do so, it will produce errors when errors are present in any of the CDK code (where those errors would not have been caught by the unit tests).
If the validation reports any errors, inspect them and correct the code accordingly.
This validation step allows the developer to fix any errors before deployment, reducing the amount of troubleshooting and fixing that would otherwise be required on failing or incorrect deployments.
Note:
While some of the existing CDK code references external resource in AWS
(outside of the CDK stack defined in the subdirectory in case),
unit testing does not actually query those resources.
As a result, unit testing will not catch changes to or errors in those (external) resource definitions.
Only cdk diff
will query actual AWS resources and produce errors accordingly
if there would be any issues with such externally defined resources.
Consequently, the cdk diff
step in the validate-all
make target requires AWS authentication.
To first test the new/updated stack in AWS before updating the main deployment, some PAVI components support dev-environment deployments. See the README of the component in case for instructions on how to deploy specific components.
To deploy a complete dev-environment to AWS for testing, execute the following command at root-level in this repository:
> make deploy-dev
Once all validation and testing returns results as expected, create a PR to merge into main. Once approved and merged, all code pushed to the main branch of this repository (both application and the AWS resources defined through CDK code) automatically gets built and deployed, through github actions.
Just as most modern software, PAVI heavily relies on third-party tools and libraries for much of its core functionality. We specifically acknowledge the creators and developers of the following third-party tools and libraries:
- BioPython: Cock PJ, Antao T, Chang JT, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422-1423. doi:10.1093/bioinformatics/btp163
- Nextflow: Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35(4):316-319. doi:10.1038/nbt.3820
- PySam: https://github.com/pysam-developers/pysam
- Samtools: Danecek P, Bonfield JK, Liddle J, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2):giab008. doi:10.1093/gigascience/giab008
Current maintainer: Manuel Luypaert