-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Scripts to create pretraining dataset #65
base: main
Are you sure you want to change the base?
Conversation
Get one conformation per molecule and prefer more molecules keeping the budget constant |
Computational budget isn't a problem. This method is super cheap. We can include more molecules and also lots of conformations per molecule. Based on your experience, how large should it be, and how should we select the conformations? |
generate conformers as you wish (rdkit), just one, and use more molecules
…On Wed, Jun 21, 2023 at 5:07 PM Peter Eastman ***@***.***> wrote:
Get one conformation per molecule and prefer more molecules keeping the
budget constant
Computational budget isn't a problem. This method is super cheap. We can
include more molecules and also lots of conformations per molecule.
Based on your experience, how large should it be, and how should we select
the conformations?
—
Reply to this email directly, view it on GitHub
<#65 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOWYZLOJ27KDF3WMLSDXMME33ANCNFSM6AAAAAAZFHQULE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Why just one? And again, how large should the dataset be? |
Rdkit is not very good to generate more than one or two. Given a certain
budget is better to have more molecules than more conformations.
Realistically training on more than 10M points starts to be problematic.
…On Wed, Jun 21, 2023 at 5:19 PM Peter Eastman ***@***.***> wrote:
Why just one? And again, how large should the dataset be?
—
Reply to this email directly, view it on GitHub
<#65 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOXMJWAXGYAYQOWUPCLXMMGHVANCNFSM6AAAAAAZFHQULE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
We don't rely on RDKit to generate the conformations, just starting points for MD simulations. |
If you're going to run MD for generating conformations, we probably do want multiple overdispersed starting points in case crossing torsional barriers is difficult. If the RDKit conformers end up being too similar, this shouldn't be too much of a problem---it's like running more MD, especially if you allow some "burn-in" equilibration before collecting samples from each conformation. I fear the only way to optimize the selection of N conformers x M snapshots/conformer is to train some models and assess generalization. There's no real a priori way to know what is optimal here, though there are probably reasonable lower bounds (N >= 3, M > 10?). |
The current code asks RDKit to generate 10 conformers. Starting from each one, it runs MD to generate 10 conformations at each of four temperatures, for a total of 400 conformations per molecule. @giadefa what is the problem with training on more than 10 million points? ANI-1 has 20 million, and people train on it all the time. |
This PR will have the scripts to generate the pretraining dataset discussed in #64. So far I've implemented the dipeptides subset. Let me know if this looks good. @giadefa I'd especially appreciate your feedback on what conformations to include, since you have experience on pretraining with large amounts of semi-empirical data.
The script only takes a few hours to run on my laptop. It generates about 310 MB of output data. I estimate the complete pretraining dataset will be around 10 GB, assuming we include the same molecules as the standard dataset and the same level of sampling for the other subsets.