The repo contains :
- ETL scripts used to populate the PanKB website DEV and PROD databases;
- scripts used to set up a dockerized self-deployed standalone MongoDB instance for the development purposes.
Liubov Pashkova, liupa@dtu.dk
Create the necessary directories if they do not exist:
sudo mkdir -p /projects
cd /projects
sudo mkdir -p pankb_web
sudo chown -R $USER pankb_web
cd pankb_web
Clone the PanKB git repo into the subdirectory /pankb_db and change to it:
git clone --branch main https://github.com/biosustain/pankb_db.git pankb_db
cd pankb_db
The ETL (Extract-Transform-Load) scripts:
- extract information about pangenomes from the Microsoft Azure Blob Storage *.json files. The storage serves as the data lake;
- transform it into the Django- and MongoDB-compatible model;
- load the transformed data into a MongoDB database instance;
- (optionally) upload the logs needed for statistics and quality control to the Azure Blob Storage after the pipeline scripts are executed.
Initially, the database tables are created by Django web framework, which the PanKB website is built on. It is achieved by setting the parameter managed = True
in the models.py
files.
Tested on Linux Ubuntu 20.04 (may need tweaks for other systems) with the following configuration:
- Git
- Python 3.8.10
The python packages versions to be installed can be found in the requirements.txt
file and installed via:
pip install -r requirements.txt
or
pip3 install -r requirements.txt
Before executing any scripts, create the .env
file under the subfolder /etl
with the following content in case of populating a self-deployed MongoDB instance:
## Do not put this file under version control!
# The MongoDB database name
MONGODB_NAME = 'pankb'
# The MongoDB root username
MONGO_INITDB_ROOT_USERNAME = 'allDbAdmin'
# The MongoDB root password
MONGO_INITDB_ROOT_PASSWORD = '<any password you choose>'
## Azure Blob Storage Connection String
BLOB_STORAGE_CONN_STRING = '<copy the Azure Blob Storage connection string from the Azure web portal>'
or in case of populating a cloud-based Azure CosmosDB for MongoDB instance:
## Do not put this file under version control!
# The MongoDB database name
MONGODB_NAME = 'pankb'
## MongoDB-PROD (Azure CosmosDB for MongoDB) Connection String
MONGODB_CONN_STRING = '<copy the Azure CosmosDB for MongoDB connection string from the Azure web portal>'
## Azure Blob Storage Connection String
BLOB_STORAGE_CONN_STRING = '<copy the Azure Blob Storage connection string from the Azure web portal>'
Then, edit the included etl/config.py
file setting the following parameters:
- the database type (self-deployed or cloud-based MongoDB instance);
- species for which pangenome data should be inserted or modified (all species or only chosen ones);
- a local folder on your machine where the etl scripts' logs will be saved (the folder should be created beforehand);
- whether the logs produced by the individual ETL scripts should be uploaded to the Azure Blob storage or not.
Finally, the ETL scripts must be executed in the following order:
organisms.py
gene_annotations.py
gene_info.py
genome_info.py
pathway_info.py
python3 <insert the respective script name here>
The scripts were not joined into one pipeline, because in practice it is more convenient to run them one by one for the sake of:
- quality control after each step;
- monitoring that the storage and RAM are not running out on the DEV server and CPUs both on the DEV and PROD servers are not overloaded (via "Metrics" section on the Azure Portal or with the help of a Remote IDE, e.g., PyCharm).
A good practice is to clean up unneccessary docker images and containers and restart the docker daemon after with the following commands:
docker system prune
sudo systemctl restart docker
Tested on Linux Ubuntu 20.04 (may need tweaks for other systems) with the following configuration:
- Git
- Docker & Docker Compose
Create a directory to be mounted as a docker volume:
mkdir -p projects/pankb_web/docker_volumes/{mongodb}
Create a file with the name ".env" under the /projects/pankb_web/pankb_db/mongodb folder in the following format (do not forget to choose your own MONGO_INITDB_ROOT_PASSWORD and MONGODB_PASSWORD and optionally other fields):
## Do not put this file under version control!
# The MongoDB root username
MONGO_INITDB_ROOT_USERNAME = 'allDbAdmin'
# The MongoDB root password
MONGO_INITDB_ROOT_PASSWORD = '<any password you choose>'
# The MongoDB database admin password
MONGODB_USERNAME = 'pankbDbOwner'
# The MongoDB database admin password
MONGODB_PASSWORD = '<any password you choose>'
MONGODB_AUTH_SOURCE = 'pankb'
Change to the appropriate folder and build the containers with Docker Compose:
cd /projects/pankb_web/pankb_db/mongodb
docker compose up -d --build
The MongoDB instance must now be ready to accept connections at the port 27017 (standard for MongoDB). The command docker ps
should show the docker container up and running, e.g.:
>>> docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
deac2ceebf43 mongo:6.0-rc "docker-entrypoint.s…" 2 weeks ago Up 41 hours 0.0.0.0:27017->27017/tcp, :::27017->27017/tcp pankb-mongodb