Skip to content

Commit

Permalink
Merge pull request #397 from openforcefield/mlpepper_iodines
Browse files Browse the repository at this point in the history
Mlpepper iodines
  • Loading branch information
amcisaac authored Oct 29, 2024
2 parents 163aedd + d0d5ffd commit 919f255
Show file tree
Hide file tree
Showing 13 changed files with 57,032 additions and 3 deletions.
3 changes: 3 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,6 @@
*.zst filter=lfs diff=lfs merge=lfs -text
*.bz filter=lfs diff=lfs merge=lfs -text
*bz2 filter=lfs diff=lfs merge=lfs -text
/mnt/storage/nobackup/nca121/qca-dataset-submission/submissions/2024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0/esp_50k_I_singlepoint_dataset.json.bz2 filter=lfs diff=lfs merge=lfs -text
/mnt/storage/nobackup/nca121/qca-dataset-submission/submissions/2024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0/dataset.pdf filter=lfs diff=lfs merge=lfs -text
/mnt/storage/nobackup/nca121/qca-dataset-submission/submissions/2024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0/iodine_filtered.json.bz2 filter=lfs diff=lfs merge=lfs -text
4 changes: 1 addition & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ These are currently used to compute properties of a minimum energy conformation
|`OpenFF Sulfur Hessian Training Coverage Supplement v1.0` | [2024-09-18-OpenFF-Sulfur-Hessian-Training-Coverage-Supplement-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-09-18-OpenFF-Sulfur-Hessian-Training-Coverage-Supplement-v1.0) | Additional Hessian training data for Sage sulfur and phosphorus parameters (from ['OpenFF Sulfur Optimization Training Coverage Supplement v1.0'](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-09-11-OpenFF-Sulfur-Optimization-Training-Coverage-Supplement-v1.0)) | O, S, C, Cl, P, N, F, Br, H | |
| `OpenFF Aniline Para Hessian v1.0` | [2024-10-07-OpenFF-Aniline-Para-Hessian-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-07-OpenFF-Aniline-Para-Hessian-v1.0) | Hessian single points for the final molecules in the `OpenFF Aniline Para Opt v1.0` [dataset](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2021-04-02-OpenFF-Aniline-Para-Opt-v1.0) | 'O', 'Cl', 'S', 'Br', 'H', 'F', 'N', 'C' ||
|`OpenFF Gen2 Hessian Dataset Protomers v1.0` | [2024-10-07-OpenFF-Gen2-Hessian-Dataset-Protomers-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-07-OpenFF-Gen2-Hessian-Dataset-Protomers-v1.0/) | Hessian single points for the final molecules in the `OpenFF Gen2 Optimization Dataset Protomers v1.0` [dataset](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2021-12-21-OpenFF-Gen2-Optimization-Set-Protomers) | 'H', 'C', 'Cl', 'P', 'F', 'Br', 'O', 'N', 'S'||

| `MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0` | [2024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0) | Set of diverse iodine containing molecules with a number of calculated electrostatic properties. | Br, Cl, S, B, O, Si, C, N, I, P, H, F| |


# Optimization Datasets
Expand Down Expand Up @@ -278,7 +278,6 @@ These are currently used to find a minimum energy conformation of a molecule.
| `OpenFF Sulfur Optimization Benchmarking Coverage Supplement v1.0` | [2024-09-18-OpenFF-Sulfur-Optimization-Benchmarking-Coverage-Supplement-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-09-18-OpenFF-Sulfur-Optimization-Benchmarking-Coverage-Supplement-v1.0) | Additional optimization benchmarking data for Sage sulfur and phosphorus parameters | S, P, Cl, C, N, O, H, Br, F | |
| `OpenFF Lipid Optimization Training Supplement v1.0` | [2024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-08-OpenFF-Lipid-Optimization-Training-Supplement-v1.0) | Additional optimization training data for Sage from representative LIPID MAPS fragments | I, Br, O, H, P, C, N, Cl, F, S | |


# TorsionDrive Datasets
These are currently used perform a complete rotation of one or more selected bonds, where optimizations are performed over a discrete set of angles.

Expand Down Expand Up @@ -336,7 +335,6 @@ These are currently used perform a complete rotation of one or more selected bon
| `OpenFF Phosphate Torsion Drives v1.0` | [2024-07-17-OpenFF-Phosphate-Torsion-Drives-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-07-17-OpenFF-Phosphate-Torsion-Drives-v1.0) | Lipid-like phosphate torsions | C, S, N, H, O, P | |
| `OpenFF Alkane Torsion Drives v1.0` | [2024-08-09-OpenFF-Alkane-Torsion-Drives-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-08-09-OpenFF-Alkane-Torsion-Drives-v1.0) | Alka/ene torsion drives | C, H | |


# GridOptimization Datasets
These are currently used perform a scan of one or more internal coordinates (bond, angle, torsion), where optimizations are performed over a discrete set of values.

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0

## Description

A single point dataset created by combining the [50k ESP from Simon](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2022-01-16-OpenFF-ESP-Fragment-Conformers-v1.0) and
[Br substituted set from Lily](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2023-11-30-OpenFF-multi-Br-ESP-Fragment-Conformers-v1.1-single-point), filtering by Cl and Br and replacing them successively with iodines, as well as some additional iodines from [Lexie and Lily's set](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-09-10-OpenFF-Iodine-Fragment-Opt-v1.0).
Each fragment had 5 conformations generated which were optimised locally using an AIMNET2 model trained to `wb97m-d3`.
This adds an extension of iodines molecules to the [original mlpepper dataset](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-07-26-MLPepper-RECAP-Optimized-Fragments-v1.0).

The aim of the dataset is to provide polarised and gas phase electrostatic properties which can be used to generate ML models
for partial charge prediction. Unlike past datasets the wavefunction will not be saved to recompute the ESP instead we recommend building the ESP
from the MBIS atomic multipoles which save substantial amount of space.

An off equilibrium data set will also be generated to enable conformation dependent prediction of charges.

## General Information


* Date: 2024-10-11
* Class: OpenFF SinglePoint Dataset
* Purpose: Electrostatic properties for ML prediction models
* Name: MLPepper RECAP Optimized Fragments v1.0 Add Iodines
* Number of unique molecules: 5733
* Number of filtered molecules: 0
* Number of conformers: 6131
* Number of conformers per molecule (min, mean, max): 1, 1.07, 3
* Mean molecular weight: 278.86
* Max molecular weight: 701.59
* Charges: [-4.0, -2.0, -1.0, 0.0, 1.0, 2.0, 3.0]
* Dataset submitter: Josh Horton/ Charlie Adams
* Dataset generator: Josh Horton/ Charlie Adams

## QCSubmit generation pipeline

- `create_optimisation.py`: Was used to create the iodine Cl and Br replacements from the original MLPepper, and
then optimisation dataset.
- `create_singlepoints.py`: Was used to create the singlepoints dataset for the optimised iodine sets.
- `create_dataset.py`: Finally this script combines the resulting datasets into a single point dataset ready for submission.

## QCSubmit Manifest

### Input Files

- `create_optimisation.py`: Script used to make the optimisation dataset for local optimisation.
- `create_singlepoints.py`: Script to create the singlepoints dataset from the optimised geometries.
- `create_dataset.py`: Script to create the singlepoint dataset from the optimization set, removing any connectivity issues.
- `basic_env.yml`: input file to create environment this dataset was built with.
- `conda_export.yml`: file created from conda export.

### Output Files
- `dataset.json.bz2`: The basic dataset ready for submission.
- `dataset.pdf`: A pdf file containing molecule 2D structures.
- `dataset.smi`: SMILES for every molecule in the submission.
- `dataset_mlpepper.smi`: SMILES of the original dataset to generate the Iodines

### Metadata

* Number of conformers: 6131
* Number of conformers per molecule (min, mean, max): 1, 1.07, 3
* Mean molecular weight: 278.86
* Max molecular weight: 701.59
* Charges: [-4.0, -2.0, -1.0, 0.0, 1.0, 2.0, 3.0]
* Elements: {B, Si, I, C, N, Br, O, S, Cl, H, F, P}
* Spec: wb97x-d/def2-tzvpp
* basis: def2-tzvpp
* implicit_solvent: None
* keywords: {'dft_spherical_points': 590, 'dft_radial_points': 99}
* maxiter: 200
* method: wb97x-d
* program: psi4
* SCF properties:
* dipole
* quadrupole
* lowdin_charges
* mulliken_charges
* mbis_charges
* mayer_indices
* wiberg_lowdin_indices
* dipole_polarizabilities
* Spec: wb97x-d/def2-tzvpp/ddx-water
* basis: def2-tzvpp
* implicit_solvent: {'ddx_model': 'pcm', 'ddx_radii_scaling': 1.1, 'ddx_radii_set': 'uff', 'ddx_solvent_epsilon': 78.4, 'ddx_solvent': 'water'}
* keywords: {'dft_spherical_points': 590, 'dft_radial_points': 99}
* maxiter: 200
* method: wb97x-d
* program: psi4
* SCF properties:
* dipole
* quadrupole
* lowdin_charges
* mulliken_charges
* mbis_charges
* mayer_indices
* wiberg_lowdin_indices
* dipole_polarizabilities
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
name: qc-submit
channels:
- conda-forge
- defaults
dependencies:
- python=3.11
- pip
- qcportal=0.53
- openff-qcsubmit=0.50.3
- openff-toolkit
- ca-certificates
- certifi
- openssl
- nglview
prefix: /mnt/nfs/home/nca121/mambaforge/envs/qc-submit
Loading

0 comments on commit 919f255

Please sign in to comment.