Skip to content

Machine learning models for the FORMED database and downstream tasks.

License

Notifications You must be signed in to change notification settings

lcmd-epfl/FORMED_ML

Repository files navigation

FORMED_ML

Machine learning models for the FORMED database and downstream tasks, and cross coupling tool.

All the raw data associated with this project can be found in the corresponding Materials Cloud record, including interactive visualization. Notably, all labels and xyz files containing molecular 3D structure are available there.

Installation

We provide a conda environment file environment.yml to install all requirements with conda into a conda environnment called FORMED. The FORMED environment can be used to run all provided scripts and notebooks in this repository. This approach has been tested in several recent releases of Ubuntu (18-22) with python versions 3.7-3.9 and the process should take a few minutes. We also provide a requirements.txt file for virtual environment installation.

Use the environment file by running conda env create -f environment.yml and activate the environment with conda activate FORMED.

Usage note

We do not provide the SLATM representation of the molecules, which is required to run the ML models, given the humongous size of the resulting arrays. Instead, we provide scripts (generate_slatm.py) to produce those from the xyz files containing the 3D structure.

To re-train the models, we recomment that you re-generate the representations and labels using the raw data in the Materials Cloud record to minimize the chance of mismatching molecules and properties. Inference can be run after generating the representations from the xyz files to predict.

WARNING!

The representations are not provided, which means that they must be generated by users using the provided scripts (generate_slatm.py). This means there is a risk of shuffling the indices w.r.t. the given data arrays. To avoid this, use the provided names arrays to ascertain sorting, or regenerate the label numpy arrays from the raw data in the Materials Cloud record. Be careful!

Content

  1. crosscoupler contains the source code and example of the cross-coupling tool, which can find suitable unique sp2 carbons in molecules and generate coupling products. The code is given as a jupyer notebook. To run the jupyter notebook you need to provide the conda environment FORMED (vide supra) by running python -m ipykernel install --user --name=FORMED. After that, you should be able to run the jupyter notebook normally by selecting the FORMED environment as kernel. Example inputs are provided and pre-filled; the expected output is detailed in the notebook and the runtime should be almost instantaneous.

  2. cv contains 10-fold cross-validation scripts for the XGBoost ML models, as well as the outputs of the scripts. To run, please execute generate_slatm.py adequately by pointing to the xyz files (vide infra) to generate the repr.npy file containing the SLATM representations.

  3. data contains raw data as numpy arrays, as extracted from the TD-DFT computations. It also contains the script to generate the SLATM representation from xyz files available in the Materials Cloud record and saving it as repr.npy. The same data is also available in the record.

  4. models contains the trained XGBoost models and learning curves. To re-run, please execute generate_slatm.py adequately by pointing to the xyz files (vide supdra) to generate the repr.npy file containing the SLATM representations.

  5. predict contains the scripts for inference using the trained models. The SLATM representations of the dimer data can be generated with the given script from the xyz files available in the Materials Cloud record. The output of the predictions is also given. To run, please execute generate_slatm.py adequately by pointing to the xyz files of the molecules to predict to generate the repr.npy file containing the SLATM representations.

  6. substructures contains the exact definition of the SMARTS keys used for substructure search.

  7. no_dimers contains the models and cross validation scripts to test the effect of including (or not) the 2506 selected dimers in the training set using pretrained models.

  8. data_no_dimers contains analogous content to data without the 2506 selected dimers.

  9. data_dimers contains analogous content to data without the FORMED dataset.

  10. select_dimers contains the source code used to select a diverse subset of 2506 dimers. The code is given as a jupyer notebook. To run the jupyter notebook you need to provide the conda environment FORMED (vide supra) by running python -m ipykernel install --user --name=FORMED. After that, you should be able to run the jupyter notebook normally by selecting the FORMED environment as kernel. Full functionality requires coupling site information obtained using the crosscoupler.

About

Machine learning models for the FORMED database and downstream tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published