Skip to content

Latest commit

 

History

History
158 lines (124 loc) · 4.47 KB

clogp.md

File metadata and controls

158 lines (124 loc) · 4.47 KB

Generating High cLogP Molecules with SyntheMol

Instructions for performing an in silico study of SyntheMol using a computational molecular property, cLogP, which is the computed octanol-water partition coefficient. Assumes relevant data has already been downloaded (see docs/README.md).

Compute cLogP

Compute cLogP on the training set.

chemfunc compute_properties \
    --data_path data/Data/1_training_data/antibiotics.csv \
    --properties clogp

For comparison purposes, compute cLogP on a random sample of REAL molecules.

chemfunc compute_properties \
    --data_path data/Data/4_real_space/random_real.csv \
    --properties clogp

Binarize clogP

Binarize cLogP by using 6.5 as a threshold.

chemfunc regression_to_classification \
    --data_path data/Data/1_training_data/antibiotics.csv \
    --regression_column clogp \
    --threshold 6.5 \
    --classification_column clogp_6.5

Antibiotics: 495 cLogP positive (3.7%) out of 13,524.

chemfunc regression_to_classification \
    --data_path data/Data/4_real_space/random_real.csv \
    --regression_column clogp \
    --threshold 6.5 \
    --classification_column clogp_6.5

Random REAL: 11 cLogP positive (0.044%) out of 25,000.

Train model

Train a Chemprop model on binary cLogP data using 30 epochs (strong model) or 1 epoch (weak model).

for EPOCHS in 1 30
do
python scripts/models/train.py \
    --data_path data/Data/1_training_data/antibiotics.csv \
    --save_dir data/Models/clogp_chemprop_${EPOCHS}_epochs \
    --dataset_type classification \
    --model_type chemprop \
    --property_column clogp_6.5 \
    --num_models 10 \
    --epochs ${EPOCHS}
done

1 epoch: ROC-AUC = 0.859 +/- 0.017, PRC-AUC = 0.198 +/- 0.045

30 epochs: ROC-AUC = 0.973 +/- 0.007, PRC-AUC = 0.736 +/- 0.049

Compute model scores for building blocks

Compute model scores for building blocks.

for EPOCHS in 1 30
do
python scripts/models/predict.py \
    --data_path data/Data/4_real_space/building_blocks.csv \
    --save_path data/Models/clogp_chemprop_${EPOCHS}_epochs/building_blocks.csv \
    --model_path data/Models/clogp_chemprop_${EPOCHS}_epochs \
    --model_type chemprop \
    --average_preds
done

Generate molecules

Apply SyntheMol to generate molecules.

for EPOCHS in 1 30
do
synthemol \
    --model_path data/Models/clogp_chemprop_${EPOCHS}_epochs \
    --model_type chemprop \
    --building_blocks_path data/Models/clogp_chemprop_${EPOCHS}_epochs/building_blocks.csv \
    --building_blocks_score_column chemprop_ensemble_preds \
    --building_blocks_id_column Reagent_ID \
    --reaction_to_building_blocks_path data/Data/4_real_space/reaction_to_building_blocks.pkl \
    --save_dir data/Data/5_generations_clogp/clogp_chemprop_${EPOCHS}_epochs \
    --max_reactions 1 \
    --n_rollout 20000 \
    --replicate
done

1 epoch: 27,123 molecules in 7 hours and 19 minutes.

30 epochs: 25,550 generated molecules in 8 hours and 53 minutes.

Compute true cLogP

Compute the true cLogP for generated molecules.

for EPOCHS in 1 30
do
chemfunc compute_properties \
    --data_path data/Data/5_generations_clogp/clogp_chemprop_${EPOCHS}_epochs/molecules.csv \
    --properties clogp
done

Plot distribution of train vs generated vs REAL cLogP.

chemfunc plot_property_distribution \
    --data_paths data/Data/1_training_data/antibiotics.csv \
    data/Data/4_real_space/random_real.csv \
    data/Data/5_generations_clogp/clogp_chemprop_30_epochs/molecules.csv \
    data/Data/5_generations_clogp/clogp_chemprop_1_epochs/molecules.csv \
    --property_column clogp \
    --save_path plots/clogp_generations.pdf \
    --min_value -7 \
    --max_value 12

Binarize the true cLogP for generated molecules using the 6.5 threshold.

for EPOCHS in 1 30
do
chemfunc regression_to_classification \
    --data_path data/Data/5_generations_clogp/clogp_chemprop_${EPOCHS}_epochs/molecules.csv \
    --regression_column clogp \
    --threshold 6.5 \
    --classification_column clogp_6.5
done

1 epoch: 3,195 positive (11.78%) out of 27,123 (vs 0.044% for random REAL).

30 epochs: 15,693 positive (61.42%) out of 25,550 (vs 0.044% for random REAL).