-
Notifications
You must be signed in to change notification settings - Fork 9
OpenMM SPICE 2.0 whitepaper
John Chodera edited this page Jul 11, 2023
·
17 revisions
The purpose of this document is to collect information about the datasets we intend to include in the OpenMM SPICE 2.0 dataset.
- Restrained minimization with GFN2-xTB to relax intramolecular geometries
- 10 diverse conformers are generated by RDKit
- each conformer is used to generate 10 conformations x 10 ps/conformation at 300 K using Langevin dynamics with OpenFF 2.1.0
- a couple of steps of minimization with GFN2-xTB to clean up problematic intramolecular geometries
-
SPICE 1.0:
ωB97M-D3BJ/def2-TZVPPD
-
OpenFF:
B3LYP-D3BJ/DZVP
-
Pretraining:
GFN2-xTB
-
Very high level for some subset:
dlpno-mp2
-
Potential improved basis set for some subset: ωB97M-D3BJ/def2-QZVPPD
- Rationale: Error goes from ~3.5 kcal/mol to <1 kcal/mol, but much more expensive
- Need higher levels of theory for open-shell metals: ?
- CBS: Could extrapolate CCSD(T)*/CBS via similar approach used in ANI1ccx paper
- Orbnet-like features: Doesn't cost anything during these calculations, but would be expensive to do externally. Costs storage space, however. HF with localization, as in Orbnet?
- An
OptimizationDataset
using 5 steps of gradient descent, which progresses most of the way to the minimum
We propose to collect for each dataset the following information (to later be pulled into README.md
files for each dataset):
- Dataset Name:
- Location:
- Rationale:
- Molecular composition:
- Structure generation strategy:
- QCFractal dataset type:
- Level(s) of theory used:
- Properties computed:
- Priority:
- Owner:
- Dataset Name: 300K 1atm bulk water clusters
- Location: https://github.com/openmm/spice-dataset/tree/main/water
- Rationale: To capture interactions representative of bulk water, we need samples of compact water clusters carved out of snapshots of bulk water at ambient temperature and pressure.
- Molecular composition: Water molecules only
-
Structure generation strategy: This collection contains 1000 conformations for a cluster of 30 water molecules extracted from bulk water simulations using the
amoeba2018.xml
force field at 300 K and 1 atm. - QCFractal dataset type: SinglePointDataset
- Level(s) of theory to be used:
- Properties to be computed: energy, gradient
- Priority: High
- Owner: Peter Eastman (@peastman)
- Dataset Name: PDB Chemical Components Dictionary (CCD) simple subset
- Location:
- Rationale: To ensure that much of the Protein Databank (PDB) can be simulated, this dataset contains a facile subset of the CCD that contains common elements and no metals
- Molecular composition: PDB CCD subset with common elements and no metals
- Structure generation strategy: Thermalization at 300K
- QCFractal dataset type: SinglePointDataset
- Level(s) of theory to be used: SPICE, OpenFF, Pretraining
- Properties to be computed: energy, gradient
- Priority:
- Owner:
- Dataset Name: PDB Chemical Components Dictionary (CCD) containing metals and heavy elements
- Location:
- Rationale: To ensure that much of the Protein Databank (PDB) can be simulated, this dataset contains a facile subset of the CCD that contains metals and heavy elements
- Molecular composition: PDB CCD subset with metals and heavier elements
- Structure generation strategy: Thermalization at 300K
- QCFractal dataset type: SinglePointDataset
- Level(s) of theory to be used: (Likely needs higher levels of theory)
- Properties to be computed: energy, gradient
- Priority:
- Owner:
- Dataset Name: Enamine REALSpace building blocks
- Location:
- Rationale: To describe the Enamine REALSpace virtual synthetic library, this dataset includes a diverse subset of Enamine building blocks
- Molecular composition: Enamine building blocks
- Structure generation strategy: Thermalization at 300K
- QCFractal dataset type: SinglePointDataset
- Level(s) of theory to be used: SPICE, OpenFF, Pretraining
- Properties to be computed: energy, gradient
- Priority:
- Owner:
- Dataset Name: Enamine REALSpace coupling chemistries
- Location:
- Rationale: To describe the Enamine REALSpace virtual synthetic library, this dataset includes a diverse subset of coupling chemistries used to link Enamine building blocks within Enamine REALSpace
- Molecular composition: Enamine REALSpace coupling chemistries
- Structure generation strategy: Thermalization at 300K
- QCFractal dataset type: SinglePointDataset
- Level(s) of theory to be used: SPICE, OpenFF, Pretraining
- Properties to be computed: energy, gradient
- Priority:
- Owner:
- Dataset Name: Organic ion interactions
- Location:
- Rationale: Interactions between organic ions with and without separating solvent
- Molecular composition: Fragments from pairs of interacting residues from the PDB where at least one fragment is an organic ion, interacting directly or via solvent bridges.
- Structure generation strategy: Geometries extracted directly from the PDB
- QCFractal dataset type:
- Level(s) of theory to be used: SPICE 1.0, OpenFF
- Properties to be computed: energy, gradient
- Priority:
- Owner:
- Dataset Name: Metal-organic interactions from the PDB
- Location:
- Rationale: Interactions of metal centers with chelating sidechains and small molecules are important for modeling metalloenzymes.
- Molecular composition: This set includes metal centers, coordinating protein sidechain fragments, and coordinated small molecules from the PDB.
- Structure generation strategy: Geometries extracted directly from the PDB
- QCFractal dataset type:
- Level(s) of theory to be used: (Likely needs to be a better level of theory than SPICE 1.0)
- Properties to be computed: energy, gradient
- Priority:
- Owner: