Get the code:
git clone https://github.com/rareylab/NetworkBalanceScaling.git
Run NBS to predict properties based on a training- and a query-dataset:
cd NetworkBalanceScaling
./workflow_predict.sh
Find NBS predictions in bace_NBS_predicted/bace_split.sdf:
...
> <pic50_predicted> (1)
7.34712390561689
...
Run NBS to optimize predictions generated by an exernal preditive model:
cd NetworkBalanceScaling
./workflow_optimize_predictions.sh
Find NBS-optimized predictions in bace_NBS_optimized_prediction/bace_result.sdf:
...
> <pic50> (1)
7.3010302
> <pic50_predicted> (1)
7.15287888112478
> <deviation> (1)
0.01553303034815148
> <pic50_predicted_optimized> (1)
7.168411911472931
...
Since the bace dataset contains measured values for the test-set, we can calculate how good the optimization performed:
StdDevs:
original all values: 1.3555922472779236
original only in net values: 1.368353774909065
original only NOT in net values: 1.2911598752840818
optimized all values: 1.1494894733393708
optimized only in net values: 1.1902379396695375
Root Mean Square Deviation R² Num_of_Samples
Predictions(all): 0.768106 0.676815 152
Predictions(only in net): 0.635844 0.781936 102
Predictions(only NOT in net): 0.984272 0.407014 50
Optimized (all): 0.762088 0.681860 152
Optimized (only): 0.624960 0.789338 102
Scores (un_opt): 0.984272 0.407014 50
Unoptimizable molecules (IDs) (50 mols):
BACE_1376
BACE_1371
...
all: scores for all test-compounds
only in net: only scores for compounds, that have a MMP connection to other compounds
only NOT in net: only compounds, that are not connected to other compounds and therefore, could not be optimized
Unoptimizable molecules: since some molecules have no MMP connection to other compounds, they couldn't be optimized, you find the list of IDs here
workflow_predict.sh: ~1min30sec
workflow_optimize_predictions.sh: ~2min
Please execute the following workflows depending on your preferences:
- installs the required conda environment if necessary
- generates a network
- optimizes the network and predicts missing molecular properties
- the output contains the predictions and the graph in separate files
- depending on hardware and data set size, this might take several minutes up to hours
- installs the required conda environment if necessary
- predicts missing molecular properties using a specified machine learning script
- generates a network containing the formerly made predictions
- optimizes the network, including the annotated predictions
- the output contains the predictions with optimization suggestions and the graph in separate files
- depending on hardware and data set size, this might take several minutes up to hours
- please note, that in the example workflow an existing model is used for a demo prediction
- building a new predictor with the provided scripts may take several hours
- execute the shell scripts provided in this directory
- for the usage of Network Balance Scaling as a predictor, please try and modify workflow_predict.sh
- for the usage of Network Balance Scaling to optimize existing predictions, please try and modify workflow_optimize_predictions.sh
- this workflow contains a precalculated model
- new models can be created with our inhouse predictor based on deepchem using the script predictors/bace_predictor_train.sh
- to generate predictions with an existing model independent from the workflow, please try the script predictors/bace_predictor_predict.sh
- please note, that our workflows only include the BACE data set of moleculenet due to a vast RAM requirement for larger data sets of this benchmark
- if you are interested in the other data sets of this benchmark, feel free to construct your pipeline and try our helper scripts in the utils directory
- you can find precalculated benchmark scenarios and the original CSV files from the moleculenet benchmark in the data directory
- note that all scripts should be startet from main directory in order of the path management
- to evaluate the results you achieved, please check out our script utils/evaluate_RMSD.py
- python >= 3.6 is needed to run the scripts
- conda is required (anaconda or miniconda); conda-environments are built automaticaly
- if you use another python environment, please make sure to provide the below-listed dependencies
These listed versions are tested, other may also work.
- python-igraph 0.8.3 or 0.10.1
- cvxpy 1.1.10 or 1.2.1
- rdkit 2020.09.04 or 2022.03.05
No non-standard hardware is required. However, problems occurred during python package installation on Windows 10 and 11 systems. Due to the vast RAM requirement, we recommend the use of a device with at least 16GB RAM and more for the use of larger data sets. This software was successfully tested on:
- openSUSE Leap 15.1
- SLES 12.5 (SUSE Linux Enterprise Server)
- Ubuntu 20.04.5
- macOS Big Sur Version 11.4
- macOS Monterey Version 12.6 arm64 (MacBook Pro, M1, 2020)
A script for creating networks with or without existing predictions at red vertices.
A script for optimizing networks. It can be used to either predict unknown properties or to optimize existing predictions depending on the option -p
The integrated predictor is based on DeepChem (https://github.com/deepchem/). Thus, the corresponding python scripts predictor/build_model.py and predictor/predictor.py additionally underly the MIT License. Our integrated predictor comprises:
- a trained graph convolution regression model for the BACE data set
- a trained message passing neural network model for the BACE data set
- the Python script predictor/build_model.py, which can be used to create a model
- the Python script predictor/predictor.py, which can be used to create predictions using an existing model
- the bash script predictor/bace_predictor_train.sh, which trains a model based on the BACE data set
- the bash script predictor/bace_predictor_predict.sh, which predicts properties for the BACE test set
- training, validation and test set for BACE can be found in our data
- for the other data sets, please use the scripts provided in utils to create your own data splits and convert files with respective formats
- deepchem 2.3 or 2.5
- rdkit 2020.09.04 or 2022.03.05
- tensorflow 1.14 or 2.10 (should fit to your deepchem version)
To incorporate existing predictions into your network, please mind the following:
- the resulting predictions need to be written to a SMILES file
- the SMILES file needs to follow the format SMILES\tID\tPROPERTY
- using workflow_optimize_predictions.sh you need to replace the call of our default predictor based on deepchem (./predictor/bace_predictor_predict.sh)
- please remember to adapt the training, validation and query set at the beginning of the bash script
- note: Please remember that all molecules need to be included into the mmpdb index. Using the network generation with existing predictions independently from workflow_optimize_predictions requires the molecules with known properties (training and validation sets) in a first SMILES file, your predictions in a second SMILES file and the index including all (training, validation and query) molecules.
- the workflows are constructed to run with SD files as input
- the SD files are converted during the workflow using the scripts provided at utils/
- for directly using SMILES files, please note that the required format consists of three tab-separated columns:
- SMILES\tID\tPROPERTY
- the SMILES format is also a requirement for the input files used in mmpdb's MMP analysis