-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pretraining dataset #64
Comments
This is a good idea, and will enable experiments to be run to assess the utility of different pretraining approaches.
10x would certainly enable useful assessments of the impact of pretraining on data efficiency to be conducted.
QCEngine supports a number of semiempirical methods, though the host programs it drives would also have to be deployed in the QCFractal compute environment. GFN2-xTB sounds like a good starting point.
If the goal is to do experiments on pretraining to improve data efficiency, it would be useful to draw molecules from the same distribution as the SPICE molecules were drawn from. We could increase both the number of molecules and number of conformers/molecule dimensions.
That seems reasonable. Does GFN2-xTB even support other properties?
If the goal is to assess the impact of pretraining on data efficiency, using the same process to generate data would be useful. If the goal is to assess other methods for generating data for utility and data efficiency and evaluate on different kinds of ensembles, it would be useful to consider generating a number of datasets constructed at different temperatures as you suggest that could either be used separately or together. If the goal is to scout which other datasets might be more useful (e.g. experimenting with training models at the same level of theory from different data subsets), it may also be of interest to generate data from different chemical spaces such as those we had previously identified as high value. For example, the PDB Chemical Components Dictionary, or Enamine Building Blocks (freely downloadable here). |
There's no need to use QCFractal for this. Running the xtb calculations takes less time than generating the conformations in the first place. It's simplest to just do everything at once in a single script. The goal is not to do any of the things you listed. The goal is to create a useful dataset that can be used in practice for pretraining models. The more data you train on, the better your model ends up being. Ideally you should include some data for very rare, very high energy conformations. That reduces the risk of the model doing something strange as soon as it gets outside the range of typical conformations. Generating all that data with a high quality method would be very expensive. So instead you pretrain on data generated with a cheap method, then fine tune on a smaller amount of high quality data. |
I think there could be value in creating a separate dataset for pretraining. It would cover the same chemical space as the standard SPICE dataset, but have many more conformations and be computed at a much lower level of theory. The idea would be to pretrain your model on the large dataset, then fine tune it on the smaller, higher quality one.
This raises several questions.
For example, the current depeptides and PubChem subsets include 50 conformations for each molecule: 25 high energy that are sampled at 500K and 25 low energy that are partially energy minimized. For the pretraining dataset we might instead include 100 conformations at each of four temperatures: 100K, 300K, 500K, and 1000K. In place of DES 370K we could use DES 5M.
The text was updated successfully, but these errors were encountered: