OpenMM SPICE 2.0 whitepaper

The purpose of this document is to collect information about the datasets we intend to include in the OpenMM SPICE 2.0 dataset.

Conformer Generation Protocols

Single snapshots

Restrained minimization with GFN2-xTB to relax intramolecular geometries

Thermalization at 300 K

10 diverse conformers are generated by RDKit
each conformer is used to generate 10 conformations x 10 ps/conformation at 300 K using Langevin dynamics with OpenFF 2.1.0
a couple of steps of minimization with GFN2-xTB to clean up problematic intramolecular geometries

Quantum chemical levels of theory

Accepted levels of theory

SPICE 1.0: ωB97M-D3BJ/def2-TZVPPD
OpenFF: B3LYP-D3BJ/DZVP
Pretraining: GFN2-xTB

Proposed levels of theory

Very high level for some subset: dlpno-mp2
Potential improved basis set for some subset: ωB97M-D3BJ/def2-QZVPPD
- Rationale: Error goes from ~3.5 kcal/mol to <1 kcal/mol, but much more expensive
Need higher levels of theory for open-shell metals: ?
CBS: Could extrapolate CCSD(T)*/CBS via similar approach used in ANI1ccx paper
Orbnet-like features: Doesn't cost anything during these calculations, but would be expensive to do externally. Costs storage space, however. HF with localization, as in Orbnet?

Dataset type

Single-point dataset

Short optimization dataset

An OptimizationDataset using 5 steps of gradient descent, which progresses most of the way to the minimum

Datasets

Dataset metadata

We propose to collect for each dataset the following information (to later be pulled into README.md files for each dataset):

Dataset Name:
Location:
Rationale:
Molecular composition:
Structure generation strategy:
QCFractal dataset type:
Level(s) of theory used:
Properties computed:
Priority:
Owner:

Accepted datasets

Water clusters

Dataset Name: 300K 1atm bulk water clusters
Location: https://github.com/openmm/spice-dataset/tree/main/water
Rationale: To capture interactions representative of bulk water, we need samples of compact water clusters carved out of snapshots of bulk water at ambient temperature and pressure.
Molecular composition: Water molecules only
Structure generation strategy: This collection contains 1000 conformations for a cluster of 30 water molecules extracted from bulk water simulations using the amoeba2018.xml force field at 300 K and 1 atm.
QCFractal dataset type: SinglePointDataset
Level(s) of theory to be used:
Properties to be computed: energy, gradient
Priority: High
Owner: Peter Eastman (@peastman)

Proposed Datasets

Solvated small molecules

Amino acid : ligand interactions from the PDB

PDB Chemical Components Dictionary (simple)

Dataset Name: PDB Chemical Components Dictionary (CCD) simple subset
Location:
Rationale: To ensure that much of the Protein Databank (PDB) can be simulated, this dataset contains a facile subset of the CCD that contains common elements and no metals
Molecular composition: PDB CCD subset with common elements and no metals
Structure generation strategy: Thermalization at 300K
QCFractal dataset type: SinglePointDataset
Level(s) of theory to be used: SPICE, OpenFF, Pretraining
Properties to be computed: energy, gradient
Priority:
Owner:

PDB Chemical Components Dictionary (metals and heavy elements)

Dataset Name: PDB Chemical Components Dictionary (CCD) containing metals and heavy elements
Location:
Rationale: To ensure that much of the Protein Databank (PDB) can be simulated, this dataset contains a facile subset of the CCD that contains metals and heavy elements
Molecular composition: PDB CCD subset with metals and heavier elements
Structure generation strategy: Thermalization at 300K
QCFractal dataset type: SinglePointDataset
Level(s) of theory to be used: (Likely needs higher levels of theory)
Properties to be computed: energy, gradient
Priority:
Owner:

Enamine REALSpace couplings

Dataset Name: Enamine REALSpace building blocks
Location:
Rationale: To describe the Enamine REALSpace virtual synthetic library, this dataset includes a diverse subset of Enamine building blocks
Molecular composition: Enamine building blocks
Structure generation strategy: Thermalization at 300K
QCFractal dataset type: SinglePointDataset
Level(s) of theory to be used: SPICE, OpenFF, Pretraining
Properties to be computed: energy, gradient
Priority:
Owner:

Enamine REALSpace couplings

Dataset Name: Enamine REALSpace coupling chemistries
Location:
Rationale: To describe the Enamine REALSpace virtual synthetic library, this dataset includes a diverse subset of coupling chemistries used to link Enamine building blocks within Enamine REALSpace
Molecular composition: Enamine REALSpace coupling chemistries
Structure generation strategy: Thermalization at 300K
QCFractal dataset type: SinglePointDataset
Level(s) of theory to be used: SPICE, OpenFF, Pretraining
Properties to be computed: energy, gradient
Priority:
Owner:

Organic ionic interactions

Dataset Name: Organic ion interactions
Location:
Rationale: Interactions between organic ions with and without separating solvent
Molecular composition: Fragments from pairs of interacting residues from the PDB where at least one fragment is an organic ion, interacting directly or via solvent bridges.
Structure generation strategy: Geometries extracted directly from the PDB
QCFractal dataset type:
Level(s) of theory to be used: SPICE 1.0, OpenFF
Properties to be computed: energy, gradient
Priority:
Owner:

Metal-organic interactions from the PDB

Dataset Name: Metal-organic interactions from the PDB
Location:
Rationale: Interactions of metal centers with chelating sidechains and small molecules are important for modeling metalloenzymes.
Molecular composition: This set includes metal centers, coordinating protein sidechain fragments, and coordinated small molecules from the PDB.
Structure generation strategy: Geometries extracted directly from the PDB
QCFractal dataset type:
Level(s) of theory to be used: (Likely needs to be a better level of theory than SPICE 1.0)
Properties to be computed: energy, gradient
Priority:
Owner:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMM SPICE 2.0 whitepaper

OpenMM SPICE 2.0 whitepaper

Conformer Generation Protocols

Single snapshots

Thermalization at 300 K

Quantum chemical levels of theory

Accepted levels of theory

Proposed levels of theory

Dataset type

Single-point dataset

Short optimization dataset

Datasets

Dataset metadata

Accepted datasets

Water clusters

Proposed Datasets

Solvated small molecules

Amino acid : ligand interactions from the PDB

PDB Chemical Components Dictionary (simple)

PDB Chemical Components Dictionary (metals and heavy elements)

Enamine REALSpace couplings

Enamine REALSpace couplings

Organic ionic interactions

Metal-organic interactions from the PDB

Clone this wiki locally