-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ligand Expo molecules #81
base: main
Are you sure you want to change the base?
Conversation
We need to figure out how this is going to work. Based on the discussion in #67 it seems like we're leaning toward something like this:
@jchodera what do you suggest for generating the conformations? You seemed to have specific ideas about how to do it. As long as we only include very small molecules, we can reasonably include up to a few hundred thousand total conformations if necessary. |
@jchodera any ideas about this? |
Apologies for the delay---the OpenMM/OpenFF renewal proposals ate up a bunch of time. Ideally, the pipeline would use the following steps:
In terms of how we specifically generate conformations, we've talked about one or both of two approaches: Questions
I should be able to tackle this early this week. Tagging @wiederm for additional comment. |
There are two distinct datasets we've talked about creating based on Ligand Expo. They're for different purposes. One possible dataset would be to look at the effect of protonation. It would include pairs of molecules that differ only in the presence of a single hydrogen. They would be in identical conformations, so the only difference would be the extra hydrogen. This would be limited to very small molecules, maybe 10 atoms or so. Applying it to large molecules would be wasted computation. If a pair of 100 atom molecules are identical except for a single hydrogen and are in identical conformations, most atoms see nearly identical environments and have nearly identical forces. The other possible dataset would be to increase our sampling of chemical space. It would include all molecules up to a fairly large size limit, maybe 100 atoms, possibly augmented with tautomers and protonation variants. That would be a lot of molecules, some of them quite large, so we would need to limit it to a very small number of conformations for each molecule, possibly only a single conformation. An alternative we've discussed for sampling more chemical space is to use Enamine molecules. We would do one or the other of those two datasets, not both. They would both have the same goal, and they would both be very expensive. Also note that the dataset from #72 already includes conformations for all Ligand Expo molecules with up to 36 atoms. |
This will eventually be a collection of molecules from Ligand Expo. I've written a draft of the script to identify molecules and variants we want to process. The current implementation of
processLigand()
is just a placeholder, showing a minimal implementation with RDKit. It takes a SMILES string and returns a new set of SMILES strings for variants. @jchodera will replace it with a better implementation.I wrapped the calculation in a TheadPoolExecutor on the assumption you would want to parallelize it. You can change it to a ProcessPoolExecutor if needed.