Skip to content

Commit

Permalink
Merge pull request #121 from samplchallenges/add_physprop_submissions
Browse files Browse the repository at this point in the history
Add submissions of physical property predictions
  • Loading branch information
bergazin committed Oct 11, 2020
2 parents 6c72826 + fb86aa9 commit f47257f
Show file tree
Hide file tree
Showing 58 changed files with 7,576 additions and 3 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ The SAMPL7 physical property challenge is now open! All three host-guest challen
- **Added additional microstates for pKa challenge**, from Bogdan Iorga (Sept. 30, 2020). Updated instructions to clarify that any states not included in pKa predictions will be assumed to be unpopulated (so participants can omit these states). Updated pKa instructions/template to allow optional submission of macro pKa values.
- **Note that experiments used specified chirality for certain physical property compounds**, `SM35`, `SM36` and `SM37`. So only the structures with specified chirality for these compounds should be used.
- **Add SAMPL7 physical properties experimental values** (Oct. 10, 2020).
- Add SAMPL7 physical properties submissions (Oct. 10, 2020)

## Challenge overview

Expand Down
6 changes: 3 additions & 3 deletions physical_property/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,9 @@ Effective permeability (log<sub>*P*<sub>*app*</sub></sub>) was measured by PAMPA
## What's here

- [`SAMPL7_molecule_ID_and_SMILES.csv`](SAMPL7_molecule_ID_and_SMILES.csv): A `.CSV` file containing SAMPL7 challenge molecule IDs and SMILES. SMILES were provided by the [Ballatore lab](https://pharmacy.ucsd.edu/faculty/ballatore).
- [`logP/`](logP/): Folder contains an input file in `.CSV` format with SMILES strings of the neutral states of the molecules. This folder contains instructions and a submission template for the logP challenge.
- [`pKa/`](pKa/): Folder contains challenge input files in `.CSV` format with SMILES of enumerated microstates. `.MOL2` and `.SDF` files of each microstate are also provided. This folder contains instructions and a submission template for the pKa challenge. Microstates (tautomers and protomers) were generated with a notebook wich uses RDKit and OpenEye tools. Additional microstates were enumerated using Chemicalize (Chemaxon) and Epik (Schrodinger) and added to the notebook generated `.CSV` files.
- [`permeability/`](permeability/): Folder contains input files in `.CSV` format with SMILES strings of molecules. This folder contains instructions and a submission template for the permeability challenge.
- [`logP/`](logP/): Folder contains an input file in `.CSV` format with SMILES strings of the neutral states of the molecules. This folder contains instructions and a submission template for the logP challenge. Also contains submission files for submitted predictions.
- [`pKa/`](pKa/): Folder contains challenge input files in `.CSV` format with SMILES of enumerated microstates. `.MOL2` and `.SDF` files of each microstate are also provided. This folder contains instructions and a submission template for the pKa challenge. Microstates (tautomers and protomers) were generated with a notebook wich uses RDKit and OpenEye tools. Additional microstates were enumerated using Chemicalize (Chemaxon) and Epik (Schrodinger) and added to the notebook generated `.CSV` files. Also contains submission files for submitted predictions.
- [`permeability/`](permeability/): Folder contains input files in `.CSV` format with SMILES strings of molecules. This folder contains instructions and a submission template for the permeability challenge. Also contains submission files for submitted predictions.
- [`images/`](images): Folder containing images related to this challenge in PDF and/or JPEG format.
- [`experimental_data/`](experimental_data/): Folder will contain experimental measurements of pK<sub>a</sub>, partitioning, and permeability values after the SAMPL7 challenge submission deadline.

Expand Down
35 changes: 35 additions & 0 deletions physical_property/logP/Analysis/SAMPL7-user-map-HG.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
21,logP-dddc-1.csv
23,logP-dddc-2.csv
25,logP-JudithWarnauDassaultSystemes.csv
26,logP-ChrisLoschen-1.csv
27,logP-ChrisLoschen-2.csv
28,logP-FabioFalcioni-1.csv
29,logp_ensemble_logp_model1.csv
30,logp_ensemble_logp_model2.csv
31,logP-EvrimArslan-6.csv
32,logP-ChrisLoschen-1_5l9kQR3.csv
33,logp_DB1.csv
34,logp_DB2.csv
35,logp_DB3.csv
36,logp_DB4.csv
37,logP-IEFPCMMST-1.csv
38,logP-ECRISM-1.csv
39,logP_AndrewPaluch_MD_1.csv
40,logP_AndrewPaluch_MD_2.csv
41,logP-PieroProcacci-NES1-B.csv
42,logP-PieroProcacci-NES1-G.csv
43,logP-PieroProcacci-NES1-J.csv
44,logP-DavyGuan-1.csv
45,LogP_chemprop_submission.csv
46,logP_RodriguezPaluch_SM12_1.csv
47,logP_RodriguezPaluch_SM8_1.csv
48,logP_RodriguezPaluch_SM8_2.csv
49,logP_RodriguezPaluch_SMD_1.csv
50,logP_RodriguezPaluch_SMD_2.csv
51,logP-MLRUCR-1.csv
52,logp-nhlbi-1.csv
53,logp-nhlbi-2.csv
54,logP_prediction_Iorga_Beckstein_LigParGen.csv
55,logP_prediction_Iorga_Beckstein_CGenFF.csv
56,logP_prediction_Iorga_Beckstein_GAFF.csv
57,logP_prediction_Iorga_Beckstein_OPLS-AA.csv
16 changes: 16 additions & 0 deletions physical_property/logP/Analysis/Scripts/get_usermap.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/env python

outfile = '../SAMPL7-user-map-HG.csv'

# Read user map from submission server
file = open('/Users/dmobley/github/SAMPL-submission-systems/SAMPL-submission-handling-shared/submissions/downloads/submission_table.txt', 'r')
text = file.readlines()
file.close()

# Write output file, removing e-mail addresses
file = open(outfile, 'w')
for line in text:
tmp = line.split(',')
if 'LOGP' in tmp[2].upper():
file.write(f'{tmp[0].strip()},{tmp[2].strip().replace(" ","_")}\n')
file.close()
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# OCTANOL TO WATER (ΔG_octanol - ΔG_water) TRANSFER FREE ENERGY PREDICTIONS
#
# This file will be automatically parsed. It must contain the following four elements:
# predictions, name of method, software listing, and method description.
# These elements must be provided in the order shown with their respective headers.
#
# Any line that begins with a # is considered a comment and will be ignored when parsing.
#
#
# PREDICTION SECTION
#
# It is mandatory to submit water to octanol (ΔG_octanol - ΔG_water) transfer free energy (TFE) predictions for all 22 molecules.
# Incomplete submissions will not be accepted.
# The energy units must be in kcal/mol.

# Please report the general molecule `ID tag` in the form of `SMXX` (e.g. SM25, SM26, etc).
# Please indicate the microstate(s) used in the `Molecule ID/IDs considered (no commas)` section (e.g. `SM25_micro000`, SM25_extra001`)
# Please report TFE standard error of the mean (SEM) and TFE model uncertainty.
#
# The data in each prediction line should be structured as follows:
# ID tag, Molecule ID/IDs considered (no commas), TFE, TFE SEM, TFE model uncertainty
#
# Your transfer free energy prediction for the neutral form does NOT have to be `SMXX_micro000` (which is the challenge provided neutral microstate).
# If you use a microstate other than the challenge provided microstate, please fill out the `Molecule ID/IDs considered (no commas)` section using a molecule ID in the form of `SMXX_extra001` (number can vary) and please list the molecule ID and it's SMILES string in your methods description in the `METHOD DESCRIPTION SECTION`.
#
# Only one entry in the second column (`Molecule ID/IDs considered (no commas)`) is required, but you should list all IDs considered/input to your calculations. See challenge instructions.
#
# If you have evaluated additional microstates then the molecule ID used in the `Molecule ID/IDs considered (no commas)` section needs to be in the format: `SMXX_extra001` (number can vary).
# If multiple microstates are used, please report the order of population in the aqueous phase in descending order.
# Please list microstate populations, SMILES strings and the molecule IDs in the `METHOD DESCRIPTION SECTION` section further below.
#
# The list of predictions must begin with the 'Predictions:' keyword as illustrated here.
Predictions:
SM33,SM33_micro000,-4.77,0.09,0.53
SM42,SM42_micro000,-4.31,0.05,0.53
SM30,SM30_micro000,-4.02,0.10,0.53
SM34,SM34_micro000,-3.59,0.15,0.53
SM43,SM43_micro000,-3.42,0.10,0.53
SM25,SM25_micro000,-3.18,0.04,0.53
SM45,SM45_micro000,-3.01,0.07,0.53
SM31,SM31_micro000,-2.94,0.09,0.53
SM39,SM39_micro000,-2.89,0.08,0.53
SM36,SM36_micro000,-2.79,0.10,0.53
SM32,SM32_micro000,-2.61,0.12,0.53
SM27,SM27_micro000,-2.53,0.11,0.53
SM41,SM41_micro000,-2.41,0.07,0.53
SM46,SM46_micro000,-2.08,0.10,0.53
SM29,SM29_micro000,-1.94,0.08,0.53
SM40,SM40_micro000,-1.87,0.09,0.53
SM37,SM37_micro000,-1.85,0.11,0.53
SM26,SM26_micro000,-1.51,0.06,0.53
SM28,SM28_micro000,-1.40,0.06,0.53
SM44,SM44_micro000,-1.07,0.08,0.53
SM38,SM38_micro000,-0.96,0.08,0.53
SM35,SM35_micro000,-0.93,0.09,0.53

#
#
# Please list your name, using only UTF-8 characters as described above. The "Participant name:" entry is required.
Participant name:
Bart Lenselink

#
#
# Please list your organization/affiliation, using only UTF-8 characters as described above.
Participant organization:
Galapagos

#
#
# NAME SECTION
#
# Please provide an informal but informative name of the method used.
# The name must not exceed 40 characters.
# The 'Name:' keyword is required as shown here.
Name:
Chemprop

#
#
# COMPUTE TIME SECTION
#
# Please provide the average compute time across all of the molecules.
# For physical methods, report the GPU and/or CPU compute time in hours.
# For empirical methods, report the query time in hours.
# Create a new line for each processor type.
# The 'Compute time:' keyword is required as shown here.
Compute time:
0.05, GPU


#
# COMPUTING AND HARDWARE SECTION
#
# Please provide details of the computing resources that were used to train models and make predictions.
# Please specify compute time for training models and querying separately for empirical prediction methods.
# Provide a detailed description of the hardware used to run the simulations.
# The 'Computing and hardware:' keyword is required as shown here.
Computing and hardware:
Linux workstation with an Intel(R) Xeon(R) W-2123 CPU & Quadro RTX 6000, training models took around a day, including a parameter search. (250 iterations)


# SOFTWARE SECTION
#
# List all major software packages used and their versions.
# Create a new line for each software.
# The 'Software:' keyword is required.
Software:
Chemprop (https://github.com/chemprop/chemprop , cloned on May 2020)
Pipeline pilot 17.2.0.1361
ADMET Predictor 9.5

# METHOD CATEGORY SECTION
#
# State which method category your prediction method is better described as:
# `Physical (MM)`, `Physical (QM)`, `Empirical`, or `Mixed`.
# Pick only one category label.
# The `Category:` keyword is required.
Category:
Empirical


# METHOD DESCRIPTION SECTION
#
# Methodology and computational details.
# Level of details should be roughly equivalent to that used in a publication.
# Please include the values of key parameters with units.
# Please explain how statistical uncertainties were estimated.
#
# If you have evaluated additional microstates, please report their SMILES strings and populations of all the microstates in this section.
# If you used a microstate other than the challenge provided microstate (`SMXX_micro000`), please list your chosen `Molecule ID` (in the form of `SMXX_extra001`) along with the SMILES string in your methods description.
#
# Use as many lines of text as you need.
# All text following the 'Method:' keyword will be regarded as part of your free text methods description.
Method:
As a basis we used the logp dataset of the OPERA models (https://github.com/kmansouri/OPERA), accessed September 2020.
This dataset was processed and standardized in Pipeline pilot, we created a test set to test different models, tailored to the challenge (SAMPL_logp_1.xml);
All molecules with an ECFP_6 TC >0.25 compared with the challenge molecules, from the Opera set were flagged as test. (233 molecules)
The training set was created from the rest of the set, by subsequently filtering out all molecules with a ECFP_6 TC >0.4 to molecules found in the test set.
Several models were build using D-MPNN (https://github.com/chemprop/chemprop), focusing on: adding helper tasks (1), changing the parameters of the model(2)

1: adding helper tasks:
We added different datasets that could be complementary in nature, as a separate task in the MT neural network:
LogP data from ChEMBL_26, LogD data from ChEMBL (AZ, doc id: CHEMBL3301361), in-house data.
Based on performance, both the ChEMBL_26, and AZ LogD data from ChEMBL were added. (all public data)
Finally we calculated logp, and LogD for all molecules using Simulations + ADMEpredictor, those predictions were added as additional tasks to the network (so 5 tasks in total)

2:
Different parameters were explored using the native hyperopt script (250 iterations), and different ensemble sizes.
The final model was trained on all data, using an ensemble size of 10.
Predictions were done on basis of this ensemble,
TFE standard error of the mean (SEM) was estimated from the ensemble predictions. TFE model uncertainty was estimated from the RMSE on the test set (0.388*1.36333619568).
TFE was calculated from logP: logP *-1.36333619568
#-RT*ln(10)
#-1.36333619568 = -1*(1.985877534*0.001)*298.15 *ln(10)

#
#
# All submissions must either be ranked or non-ranked.
# Only one ranked submission per participant is allowed.
# Multiple ranked submissions from the same participant will not be judged.
# Non-ranked submissions are accepted so we can verify that they were made before the deadline.
# The "Ranked:" keyword is required, and expects a Boolean value (True/False)
Ranked:
True
118 changes: 118 additions & 0 deletions physical_property/logP/Analysis/Submissions/logP-ChrisLoschen-1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# OCTANOL TO WATER (ΔG_octanol - ΔG_water) TRANSFER FREE ENERGY PREDICTIONS
#
Predictions:
SM25,SM25_micro000,-3.42,0.36,0.50
SM26,SM26_micro000,-1.35,0.00,0.50
SM27,SM27_micro000,-2.17,0.15,0.50
SM28,SM28_micro000,-1.14,0.36,0.50
SM29,SM29_micro000,-2.12,0.39,0.50
SM30,SM30_micro000,-3.65,0.43,0.50
SM31,SM31_micro000,-2.61,0.45,0.50
SM32,SM32_micro000,-3.13,0.13,0.50
SM33,SM33_micro000,-4.97,0.27,0.50
SM34,SM34_micro000,-3.74,0.28,0.50
SM35,SM35_micro000,-1.99,0.34,0.50
SM36,SM36_micro000,-3.54,0.36,0.50
SM37,SM37_micro000,-2.53,0.52,0.50
SM38,SM38_micro000,-1.02,0.20,0.50
SM39,SM39_micro000,-2.76,0.20,0.50
SM40,SM40_micro000,-1.49,0.21,0.50
SM41,SM41_micro000,-2.59,0.00,0.50
SM42,SM42_micro000,-4.91,0.00,0.50
SM43,SM43_micro000,-3.29,0.08,0.50
SM44,SM44_micro000,-1.16,0.00,0.50
SM45,SM45_micro000,-3.30,0.10,0.50
SM46,SM46_micro000,-1.75,0.16,0.50

#
#
# Please list your name, using only UTF-8 characters as described above. The "Participant name:" entry is required.
Participant name:
Chris Loschen

#
#
# Please list your organization/affiliation, using only UTF-8 characters as described above.
Participant organization:
not-organized/private

#
#
# NAME SECTION
#
# The 'Name:' keyword is required as shown here.
Name:
ffsampled_deeplearning_cl1

#
#
# COMPUTE TIME SECTION
#
# Please provide the average compute time across all of the molecules.
# For physical methods, report the GPU and/or CPU compute time in hours.
# For empirical methods, report the query time in hours.
# Create a new line for each processor type.
# The 'Compute time:' keyword is required as shown here.
Compute time:
0.01 hours, GPU

#
# COMPUTING AND HARDWARE SECTION
#
# Please provide details of the computing resources that were used to train models and make predictions.
# Please specify compute time for training models and querying separately for empirical prediction methods.
# Provide a detailed description of the hardware used to run the simulations.
# The 'Computing and hardware:' keyword is required as shown here.
Computing and hardware:
All the simulations were performed on one GeForce GTX 1080 on a single linux machine.
Training of 100 epochs took about 1 hours.

# SOFTWARE SECTION
#
# List all major software packages used and their versions.
# Create a new line for each software.
# The 'Software:' keyword is required.
Software:
Schnetpack 0.3
Fastai 1.0.6
RDKit 2020.03.3

# METHOD CATEGORY SECTION
#
# State which method category your prediction method is better described as:
# `Physical (MM)`, `Physical (QM)`, `Empirical`, or `Mixed`.
# Pick only one category label.
# The `Category:` keyword is required.
Category:
Empirical

# METHOD DESCRIPTION SECTION
#
# Methodology and computational details.
# Level of details should be roughly equivalent to that used in a publication.
# Please include the values of key parameters with units.
# Please explain how statistical uncertainties were estimated.
#
# If you have evaluated additional microstates, please report their SMILES strings and populations of all the microstates in this section.
# If you used a microstate other than the challenge provided microstate (`SMXX_micro000`), please list your chosen `Molecule ID` (in the form of `SMXX_extra001`) along with the SMILES string in your methods description.
#
# Use as many lines of text as you need.
# All text following the 'Method:' keyword will be regarded as part of your free text methods description.
Method:
A modified version of the deeplearning package schnetpack was used which is based on the work of Schütt et al.[1] and may be seen as a variant of message passing neural networks. However, the input does not use the chemical graph but is only 3D structure based and does not rely on any kind of precomputed descriptors, rather the molecular representations are learned on-the-fly during training. The neural net was trained with the fastai library, version 1 [2] using accelerated learning, so-called super-convergence as published by L. Smith et al.[3] and other tools available from the fastai library, which allow for fast iterations during testing.
A curated logP dataset was assembled mainly from the work of Mansouri et al.[4] and used for training, testing and validation. Input structures for the neural net have been generated from the provided SMILES via the distance geometry approach as implemented in the RDKIT and a quick conformational sampling was carried out using the MMFF94 forcefield.
Before the 3D structure generation molecules have been brought into a canonical representation with the RDKit. Statistical uncertainties were estimated based on the average of 10 distinct predictions runs and on the overall test sets performance.

[1] Schutt, K. T., Kessel, P., Gastegger, M., Nicoli, K. A., Tkatchenko, A., & Müller, K. R. (2018). SchNetPack: A deep learning toolbox for atomistic systems. Journal of chemical theory and computation, 15(1), 448-455.
[2] https://fastai1.fast.ai/
[3] Smith, L. N., & Topin, N. (2019, May). Super-convergence: Very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications (Vol. 11006, p. 1100612). International Society for Optics and Photonics.
[4] Mansouri, K., Grulke, C. M., Judson, R. S., & Williams, A. J. (2018). OPERA models for predicting physicochemical properties and environmental fate endpoints. Journal of cheminformatics, 10(1), 10.)

#
# All submissions must either be ranked or non-ranked.
# Only one ranked submission per participant is allowed.
# Multiple ranked submissions from the same participant will not be judged.
# Non-ranked submissions are accepted so we can verify that they were made before the deadline.
# The "Ranked:" keyword is required, and expects a Boolean value (True/False)
Ranked:
True
Loading

0 comments on commit f47257f

Please sign in to comment.