Ligand Expo molecules #81

peastman · 2023-09-11T20:05:20Z

This will eventually be a collection of molecules from Ligand Expo. I've written a draft of the script to identify molecules and variants we want to process. The current implementation of processLigand() is just a placeholder, showing a minimal implementation with RDKit. It takes a SMILES string and returns a new set of SMILES strings for variants. @jchodera will replace it with a better implementation.

I wrapped the calculation in a TheadPoolExecutor on the assumption you would want to parallelize it. You can change it to a ProcessPoolExecutor if needed.

peastman · 2023-10-28T00:22:41Z

We need to figure out how this is going to work. Based on the discussion in #67 it seems like we're leaning toward something like this:

Select every Ligand Expo molecule up to some fairly small maximum size.
Identify all low energy tautomer/protonation states for each one (or perhaps just protonation states?)
Somehow generate conformations for them.

@jchodera what do you suggest for generating the conformations? You seemed to have specific ideas about how to do it. As long as we only include very small molecules, we can reasonably include up to a few hundred thousand total conformations if necessary.

peastman · 2023-11-11T22:56:00Z

@jchodera any ideas about this?

jchodera · 2023-11-12T18:26:56Z

Apologies for the delay---the OpenMM/OpenFF renewal proposals ate up a bunch of time.

Ideally, the pipeline would use the following steps:

Apply filtering rules (e.g. elements, min/max number of heavy atoms)
Sort by popularity so that the most-used chemical components should appear first in the list
Expand protonation/tautomeric states with Epik, keeping protonation states more than a min solution population e.g. exp(-6) to get states up to a ~6 kT penalty)

In terms of how we specifically generate conformations, we've talked about one or both of two approaches:
A. After expanding protonation/tautomeric states, generate MD at some temperature (300K? 400?) with a surrogate potential, such as GFN2-xTB, in vacuum
B. Before expanding protonation/tautomeric states, we enumerate <10 conformers with the OpenFF Molecule.generate_conformers (which can either use RDKit or OpenEye toolkits), vary protonation/tautomer states of each conformation, and then subject these directly to max ~3 steps of optimization in an OptimizationDataset

Questions

For (1), what filtering rules should we apply? Should we really use total number of atoms in [3,100] as the filter criteria, or filter on heavy atoms or molecular weight? Should we expand to more main group elements, which our level of theory seems to handle well, or would we need pseudopotentials to make these efficient?
For (3), I think we established that 6 kT was fine.
I think you prefer starting with (A), but it would be great to include a (B) dataset as well, even if just for the OpenFF level of theory
In terms of naming of each molecule, is there a preference? Can we use something like {PDB ID}_{conformer index}_{protonation/tautomer state index} (or some permutation)? Or should we be using IUPAC names? Or even SMILES?
Do we want to consider a separate dataset that includes transition metal elements?

I should be able to tackle this early this week.

Tagging @wiederm for additional comment.

peastman · 2023-11-12T20:03:50Z

There are two distinct datasets we've talked about creating based on Ligand Expo. They're for different purposes.

One possible dataset would be to look at the effect of protonation. It would include pairs of molecules that differ only in the presence of a single hydrogen. They would be in identical conformations, so the only difference would be the extra hydrogen. This would be limited to very small molecules, maybe 10 atoms or so. Applying it to large molecules would be wasted computation. If a pair of 100 atom molecules are identical except for a single hydrogen and are in identical conformations, most atoms see nearly identical environments and have nearly identical forces.

The other possible dataset would be to increase our sampling of chemical space. It would include all molecules up to a fairly large size limit, maybe 100 atoms, possibly augmented with tautomers and protonation variants. That would be a lot of molecules, some of them quite large, so we would need to limit it to a very small number of conformations for each molecule, possibly only a single conformation.

An alternative we've discussed for sampling more chemical space is to use Enamine molecules. We would do one or the other of those two datasets, not both. They would both have the same goal, and they would both be very expensive.

Also note that the dataset from #72 already includes conformations for all Ligand Expo molecules with up to 36 atoms.

First draft of script to find variants

cb3686d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ligand Expo molecules #81

Ligand Expo molecules #81

peastman commented Sep 11, 2023

peastman commented Oct 28, 2023

peastman commented Nov 11, 2023

jchodera commented Nov 12, 2023

peastman commented Nov 12, 2023

Ligand Expo molecules #81

Are you sure you want to change the base?

Ligand Expo molecules #81

Conversation

peastman commented Sep 11, 2023

peastman commented Oct 28, 2023

peastman commented Nov 11, 2023

jchodera commented Nov 12, 2023

Questions

peastman commented Nov 12, 2023