Skip to content

Commit

Permalink
Merge pull request #129 from samplchallenges/logD_analysis
Browse files Browse the repository at this point in the history
Add logD analysis
  • Loading branch information
davidlmobley committed Apr 20, 2021
2 parents a6d71eb + 853d1bd commit 7b45442
Show file tree
Hide file tree
Showing 289 changed files with 8,006 additions and 124 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,9 @@ All SAMPL7 challenges are now closed. Note the first phase of the SAMPL8 host-gu
- **Release 0.6** (Oct. 13, 2020, DOI [10.5281/zenodo.4086044](https://dx.doi.org/10.5281/zenodo.3975152)): Release the finalized the physical properties challenge inputs, formats, submissions and experimental results. A later release will include the results of analysis. These changes were all available in master earlier (see detailed changelog in release notes), but this provides an official release. Analysis of physical properties results will come at a later date.

### Changes not in a release
- Added physical properties analysis (December 2020-January 2021)
- Fixed two submissions that had errors and updated the overview plots/stats and individual plots for the two affected submissions (4/9/2021)
- Added physical properties analysis for logP and pKa (December 2020-January 2021)
- Fixed two log *P* submissions that had errors and updated the overview plots/stats and individual plots for the two affected submissions (4/9/2021)
- After fixing the two log *P* submissions that had errors, log *D* estimates were regenerated and analysis was done (4/19/2021)


## Challenge overview
Expand Down
72 changes: 68 additions & 4 deletions physical_property/logD/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,74 @@
## SAMPL7 log *D* Predictions

Placeholder
Ranked SAMPL7 pK<sub>a</sub> and log *P* predictions were combined to estimate log *D*<sub>7.4</sub>. The Mathematica notebook used to do this analysis is available in the manifest.

General analysis of log *D* predictions include calculated vs predicted log *D* correlation plots and 6 performance statistics (RMSE, MAE, ME, R^2, linear regression slope(m), and error slope(ES)) for all the submissions.
95%-percentile bootstrap confidence intervals of all the statistics were reported.

Molecular statistics analysis was performed to indicate which molecules were more difficult to predict accurately across submitted methods. Error statistics (MAE and RMSE) were calculated for each molecule averaging across all methods or for all methods within a method category.

## Manifest
- [`analysis/`](analysis/) - Contains analysis of log *D*<sub>7.4</sub> predictions generated from log *P* and pK<sub>a</sub> predictions.
- [`calc_logD.nb`](calc_logD.nb) - Wolfram Mathematica `.nb` file that calculates and exports SAMPL7 distribution coefficients log *D*<sub>7.4</sub> for participants that had submitted a ranked log *P* and a ranked pK<sub>a</sub> submission. The notebook gathers the predicted macroscopic acidity constants and the partition coefficients from [`pKa_submission_collection.csv`](../pKa/analysis/macrostate_analysis/analysis_outputs_ranked_submissions/pKa_submission_collection.csv) and [`logP_submission_collection.csv`](../logP/analysis/analysis_outputs_ranked_submissions/logP_submission_collection.csv), respectively. The log *D*<sub>7.4</sub> is then calculated under the assumption that the ionic species can not enter the organic phase [1]. Because the acidity constants listed in [`pKa_submission_collection.csv`](../pKa/analysis/macrostate_analysis/analysis_outputs_ranked_submissions/pKa_submission_collection.csv) do not contain information about the charge states of the protonated and deprotonated species, the consensus of models that had submitted macroscopic pK<sub>a</sub> values including the charge states was used to determine that eq. 4 should be used for all compounds.
- [`calculate_logD/`](./calculate_logD/)
- `calc_logD.nb` - Wolfram Mathematica `.nb` file that calculates and exports SAMPL7 distribution coefficients log *D*<sub>7.4</sub> for participants that had submitted a ranked log *P* and a ranked pK<sub>a</sub> submission. The notebook gathers the predicted macroscopic acidity constants and the partition coefficients from [`pKa_submission_collection.csv`](../pKa/analysis/macrostate_analysis/analysis_outputs_ranked_submissions/pKa_submission_collection.csv) and [`logP_submission_collection.csv`](../logP/analysis/analysis_outputs_ranked_submissions/logP_submission_collection.csv), respectively. The log *D*<sub>7.4</sub> is then calculated under the assumption that the ionic species can not enter the organic phase [1]. Because the acidity constants listed in [`pKa_submission_collection.csv`](../pKa/analysis/macrostate_analysis/analysis_outputs_ranked_submissions/pKa_submission_collection.csv) do not contain information about the charge states of the protonated and deprotonated species, the consensus of models that had submitted macroscopic pK<sub>a</sub> values including the charge states was used to determine that eq. 4 should be used for all compounds. Notebook created by Nicolas Tielker.
- `logD_submission_collection.csv` - Contains log *D*<sub>7.4</sub> predictions generated from log *P* and pK<sub>a</sub> predictions.
- `logD_predictions/` - Contains SAMPL style submission files created from the log *D* data found in `logD_submission_collection.csv`. One reference method and one null method were added to this folder to be used as a comparison to other methods in the general SAMPL analysis. These submission style files were used as input to the general SAMPL analysis script (`logD_analysis.py`) and the output can be found in `analysis_outputs_all_submissions/` and `analysis_outputs_ranked_submissions/`.
- [`logD_analysis.py`](logD_analysis.py) - Python script that parses submissions and performs the analysis. Provides two separate treatment for ranked blind predictions alone (output directory: [`analysis_outputs_ranked_submissions/`](analysis_outputs_ranked_submissions/)) and ranked and reference calculations together (output directory: [`analysis_outputs_all_submissions/`](analysis_outputs_all_submissions/)). Reference calculations are provided as reference/comparison methods. logD_analysis.py
- [`logD_analysis2.py`](logD_analysis2.py) - Python script that performs the analysis of molecular statistics (Error statistics, MAE and RMSE, calculated across methods for each molecule.)
- [`logD_experimental_values.csv`](logD_experimental_values.csv) - A CSV (`.csv`) table of potentiometric and shake-flask log *D* measurements of the 22 SAMPL molecules.
- [`analysis_outputs_ranked_submissions/`](analysis_outputs_ranked_submissions/) - This directory contain analysis outputs of ranked submissions only.
- `error_for_each_logD.pdf` - Violin plots that show error distribution of predictions related to each experimental log *P*.
- `logDCorrelationPlots/` - This directory contains plots of predicted vs. experimental log *P* values with linear regression line (blue) for each method. Files are named according to the submitted method name of each subission, which can be found in `statistics_table.csv`. In correlation plots, the dashed black line has a slope of 1. Dark and light green shaded areas indicate +-0.5 and +-1.0 log *P* unit error regions, respectively.
- `logDCorrelationPlotsWithSEM/` - This directory contains similar plots to the `logDCorrelationPlots/` directory with error bars added for Standard Error of the Mean (SEM) of experimental and predicted values for submissions that reported these values. Experimental log *P* SEM values are either too small to be able to see the horizontal error bars, or some of the experimental log *P* SEM values were not collected.
- `AbsoluteErrorPlots/` - This directory contains a bar plot for each method showing the absolute error for each log *P* prediction compared to the experimental value.
- `StatisticsTables/` - This directory contains machine-readable copies of the Statistics Table, bootstrap distributions of performance statistics, and overall performance comparison plots based on RMSE and MAE values.
- `statistics.csv`- A table of performance statistics (RMSE, MAE, ME, R^2, linear regression slope(m), Kendall's Tau, and error slope(ES)) for all the submissions.
- `RMSE_vs_method_plot.pdf`
- `RMSE_vs_method_plot_colored_by_method_category.pdf`
- `RMSE_vs_method_plot_for_Physical_MM_category.pdf`
- `RMSE_vs_method_plot_for_Physical_QM_category.pdf`
- `RMSE_vs_method_plot_for_Empirical_category.pdf`
- `RMSE_vs_method_plot_for_Mixed_category.pdf`
- `RMSE_vs_method_plot_physical_methoods_colored_by_method_category.pdf`
- `MAE_vs_method_plot.pdf`
- `MAE_vs_method_plot_colored_by_method_category.pdf`
- `MAE_vs_method_plot_for_Physical_MM_category.pdf`
- `MAE_vs_method_plot_for_Physical_QM_category.pdf`
- `MAE_vs_method_plot_for_Empirical_category.pdf`
- `MAE_vs_method_plot_for_Mixed_category.pdf`
- `kendalls_tau_vs_method_plot.pdf`
- `MAE_vs_method_plot_physical_methoods_colored_by_method_category.pdf`
- `kendalls_tau_vs_method_plot_colored_by_method_category.pdf`
- `kendalls_tau_vs_method_plot_for_Physical_MM_category.pdf`
- `kendalls_tau_vs_method_plot_for_Physical_QM_category.pdf`
- `kendalls_tau_vs_method_plot_for_Empirical_category.pdf`
- `kendalls_tau_vs_method_plot_for_Mixed_category.pdf`
- `kendall_tau_vs_method_plot_physical_methoods_colored_by_method_category.pdf`
- `Rsquared_vs_method_plot.pdf`
- `Rsquared_vs_method_plot_colored_by_method_category.pdf`
- `Rsquared_vs_method_plot_colored_by_type.pdf`
- `Rsquared_vs_method_plot_for_Empirical_category.pdf`
- `Rsquared_vs_method_plot_for_Mixed_category.pdf`
- `Rsquared_vs_method_plot_for_Physical_MM_category.pdf`
- `Rsquared_vs_method_plot_for_Physical_QM_category.pdf`
- `Rsquared_vs_method_plot_physical_methoods_colored_by_method_category.pdf`
- `statistics_bootstrap_distributions.pdf` - Violin plots showing bootstrap distributions of performance statistics of each method. Each method is labelled according to the method name of the submission.

- `QQPlots/` - Quantile-Quantile plots for the analysis of model uncertainty predictions.
- `MolecularStatisticsTables/` - This directory contains tables and barplots of molecular statistics analysis (Error statistics, MAE and RMSE, calculated across methods for each molecule.)
- `MAE_vs_molecule_ID_plot.pdf` - Barplot of MAE calculated for each molecule averaging over all prediction methods.
- `RMSE_vs_molecule_ID_plot.pdf` - Barplot of RMSE calculated for each molecule averaged over all prediction methods
- `molecular_error_statistics.csv` - MAE and RMSE statistics calculated for each molecule averaged over all prediction methods. 95% confidence intervals were calculated via bootstrapping (10000 samples).
- `molecular_MAE_comparison_between_method_categories.pdf` - Barplot of MAE calculated for each method category for each molecule averaging over all predictions in that method category. The colors of the bars indicate method categories.
- `molecular_error_distribution_ridge_plot_all_methods.pdf`: Error distribution of each molecule, based on predictions from all ranked methods.
- `molecular_error_distribution_ridge_plot_well_performing_methods.pdf`: Error distribution of each molecule based on predictions from only methods who are determined as consistently well-performing methods.
- `Empirical/` - This directory contains table and barplots of molecular statistics analysis calculated only for methods in the Empirical method category.
- `Physical_MM/` - This directory contains table and barplots of molecular statistics analysis calculated only for methods in the Physical MM method category.
- `Physical_QM/` - This directory contains table and barplots of molecular statistics analysis calculated only for methods in the Physical QM method category.

- [`analysis_outputs_all_submissions/`](analysis_outputs_all_submissions/) - Duplicates the [`analysis_outputs_ranked_submissions/`](analysis_outputs_ranked_submissions/) directory, but reference calculations. Also includes the additional plots:
- `StatisticsTables/MAE_vs_method_plot_colored_by_type.pdf`: Barplot showing overall performance by MAE, with reference calculations colored differently.
- `StatisticsTables/RMSE_vs_method_plot_colored_by_type.pdf`: Barplot showing overall performance by RMSE, with reference calculations colored differently.
- [`analysis_different_pKa_logP_combos`](analysis_different_pKa_logP_combos) - Contains similar analysis to `analysis_outputs_all_submissions/` except it includes some additional pK<sub>a</sub> and log *P* combinations (for log *D* estimation).

## References
[1] Bannan, Caitlin C., Kalistyn H. Burley, Michael Chiu, Michael R. Shirts, Michael K. Gilson, and David L. Mobley. “Blind Prediction of Cyclohexane–water Distribution Coefficients from the SAMPL5 Challenge.” Journal of Computer-Aided Molecular Design 30, no. 11 (November 2016): 927–44.
[1] Bannan, Caitlin C., Kalistyn H. Burley, Michael Chiu, Michael R. Shirts, Michael K. Gilson, and David L. Mobley. “Blind Prediction of Cyclohexane–water Distribution Coefficients from the SAMPL5 Challenge.” Journal of Computer-Aided Molecular Design 30, no. 11 (November 2016): 927–44.
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
## What's here
- [`analysis/`](analysis/) - Analysis of log *D* predictions. Analysis is similar to that found in `../analysis_outputs_all_submissions/` except it includes some additional pK<sub>a</sub> and log *P* combinations (for log *D* estimation). Method name *logP_experimental + EC_RISM* combines the experimental log *P* with the top performing pK<sub>a</sub> method (based on RMSE), method *logP_experimental + pKa_experimental* combines the experimental log *P* and pK<sub>a</sub> value, method *TFE MLR + EC_RISM* combines the best performing (based on RMSE) log *P* and pK<sub>a</sub> methods, method *TFE MLR + pKa_experimental* combines the best performing (based on RMSE) log *P* method with the experimental pK<sub>a</sub>, method *logP_experimental + DFT_M05-2X_SMD* combines the experimental log *P* with an average performing pK<sub>a</sub> method, method *NES-1 (GAFF2/OPC3) B + pKa_experimental* combines a log *P* method with average performance with the experimental pK<sub>a</sub>.
- `error_for_each_logD.pdf` - Violin plots that show error distribution of predictions related to each experimental log *P*.
- `logDCorrelationPlots/` - This directory contains plots of predicted vs. experimental log *P* values with linear regression line (blue) for each method. Files are named according to the submitted method name of each subission, which can be found in `statistics_table.csv`. In correlation plots, the dashed black line has a slope of 1. Dark and light green shaded areas indicate +-0.5 and +-1.0 log *P* unit error regions, respectively.
- `logDCorrelationPlotsWithSEM/` - This directory contains similar plots to the `logDCorrelationPlots/` directory with error bars added for Standard Error of the Mean (SEM) of experimental and predicted values for submissions that reported these values. Experimental log *P* SEM values are either too small to be able to see the horizontal error bars, or some of the experimental log *P* SEM values were not collected.
- `AbsoluteErrorPlots/` - This directory contains a bar plot for each method showing the absolute error for each log *P* prediction compared to the experimental value.
- `StatisticsTables/` - This directory contains machine-readable copies of the Statistics Table, bootstrap distributions of performance statistics, and overall performance comparison plots based on RMSE and MAE values.
- `statistics.csv`- A table of performance statistics (RMSE, MAE, ME, R^2, linear regression slope(m), Kendall's Tau, and error slope(ES)) for all the submissions.
- `RMSE_vs_method_plot.pdf`
- `RMSE_vs_method_plot_colored_by_method_category.pdf`
- `RMSE_vs_method_plot_colored_by_type.pdf`
- `MAE_vs_method_plot.pdf`
- `MAE_vs_method_plot_colored_by_method_category.pdf`
- `MAE_vs_method_plot_colored_by_type.pdf`
- `kendalls_tau_vs_method_plot.pdf`
- `kendalls_tau_vs_method_plot_colored_by_method_category.pdf`
- `kendalls_tau_vs_method_plot_colored_by_type.pdf`
- `Rsquared_vs_method_plot.pdf`
- `Rsquared_vs_method_plot_colored_by_method_category.pdf`
- `Rsquared_vs_method_plot_colored_by_type`.pdf
- `QQPlots/` - Quantile-Quantile plots for the analysis of model uncertainty predictions.
- `MolecularStatisticsTables/` - This directory contains tables and barplots of molecular statistics analysis (Error statistics, MAE and RMSE, calculated across methods for each molecule.)
- `MAE_vs_molecule_ID_plot.pdf` - Barplot of MAE calculated for each molecule averaging over all prediction methods.
- `RMSE_vs_molecule_ID_plot.pdf` - Barplot of RMSE calculated for each molecule averaged over all prediction methods
- `molecular_error_statistics.csv` - MAE and RMSE statistics calculated for each molecule averaged over all prediction methods. 95% confidence intervals were calculated via bootstrapping (10000 samples).
- `Empirical/` - This directory contains table and barplots of molecular statistics analysis calculated only for methods in the empirical method category.
- `Empirical_Experimental_pKa/` - This directory contains table and barplots of molecular statistics analysis calculated only for methods combining an empirical method and experimental pK<sub>a</sub>.
- `Empirical_QM/` - This directory contains table and barplots of molecular statistics analysis calculated only for methods combining an empirical and QM method.
- `Experimental_logP_QM/` - This directory contains table and barplots of molecular statistics analysis calculated only for methods combining experimental log *P* and QM predictions.
- `Experimental_only/`
- `Physical_MM_Experimental_pKa/` - This directory contains table and barplots of molecular statistics analysis calculated only for methods combining MM methods and experimental pK<sub>a</sub>.
- `Physical_MM_QM_LEC/` - This directory contains table and barplots of molecular statistics analysis calculated only for methods combining MM and QM+LEC.
- `Physical_QM/` - This directory contains table and barplots of molecular statistics analysis calculated only for methods in the physical QM category.
- [`submission_collection_files/`](logD_submission_collection.csv) - Contains analysis of log *D*<sub>7.4</sub> predictions generated from log *P* and pK<sub>a</sub> predictions using the `calc_logD.nb` notebook found in `../calculate_logD`.
- `experimental_pKa_and_logP_combined.csv` - log *D* predictions generated from the experimental log *P* and pK<sub>a</sub> values.
- `best_pKa_and_logP_method_combined.csv` - log *D* predictions generated from the top log *P* prediction *P* and pK<sub>a</sub> predictions.
- `experimental_logP_and_participant_pKa_predictions_combined.csv` - log *D* predictions generated from experimental log *P* and pK<sub>a</sub> predictions.
- `experimental_pKa_and_participant_logP_predictions_combined.csv` - log *D* predictions generated from experimental log *P* and log *P* predictions.
- [`make_input.ipynb`](make_input.ipynb) - Example of a notebook that takes the log *D* data in `submission_collection_files/` and converts it to SAMPL style submission format to be used as input for analysis.
- [`input_files/`](input_files/) - Contains SAMPL style submission files that were created from the log *D* data generated by [`calc_logD.nb`](calc_logD.nb). Also contains files from `../calculate_logD/logD_predictions/`. These were used as input for the general SAMPL analysis.
- [`user-map2.csv`](user-map2.csv) - manually created user map of all log *D* estimate files. Used as input for the general SAMPL analysis scripts.
- [`experimental_value_files/`](experimental_value_files/) - Contains files that have log *P* and pK<sub>a</sub> values. Used as input for [`calc_logD.nb`](calc_logD.nb).
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading

0 comments on commit 7b45442

Please sign in to comment.