Machine learning models for the FORMED database and downstream tasks, and cross coupling tool.
All the raw data associated with this project can be found in the corresponding Materials Cloud record, including interactive visualization. Notably, all labels and xyz files containing molecular 3D structure are available there.
We provide a conda environment file environment.yml
to install all requirements with conda into a conda environnment called FORMED
. The FORMED
environment can be used to run all provided scripts and notebooks in this repository. This approach has been tested in several recent releases of Ubuntu (18-22) with python versions 3.7-3.9 and the process should take a few minutes. We also provide a requirements.txt
file for virtual environment installation.
Use the environment file by running conda env create -f environment.yml
and activate the environment with conda activate FORMED
.
We do not provide the SLATM representation of the molecules, which is required to run the ML models, given the humongous size of the resulting arrays. Instead, we provide scripts (generate_slatm.py
) to produce those from the xyz files containing the 3D structure.
To re-train the models, we recomment that you re-generate the representations and labels using the raw data in the Materials Cloud record to minimize the chance of mismatching molecules and properties. Inference can be run after generating the representations from the xyz files to predict.
The representations are not provided, which means that they must be generated by users using the provided scripts (generate_slatm.py
). This means there is a risk of shuffling the indices w.r.t. the given data arrays. To avoid this, use the provided names
arrays to ascertain sorting, or regenerate the label numpy arrays from the raw data in the Materials Cloud record. Be careful!
-
crosscoupler
contains the source code and example of the cross-coupling tool, which can find suitable unique sp2 carbons in molecules and generate coupling products. The code is given as a jupyer notebook. To run the jupyter notebook you need to provide the conda environmentFORMED
(vide supra) by runningpython -m ipykernel install --user --name=FORMED
. After that, you should be able to run the jupyter notebook normally by selecting theFORMED
environment as kernel. Example inputs are provided and pre-filled; the expected output is detailed in the notebook and the runtime should be almost instantaneous. -
cv
contains 10-fold cross-validation scripts for the XGBoost ML models, as well as the outputs of the scripts. To run, please executegenerate_slatm.py
adequately by pointing to the xyz files (vide infra) to generate therepr.npy
file containing the SLATM representations. -
data
contains raw data asnumpy
arrays, as extracted from the TD-DFT computations. It also contains the script to generate the SLATM representation from xyz files available in the Materials Cloud record and saving it asrepr.npy
. The same data is also available in the record. -
models
contains the trained XGBoost models and learning curves. To re-run, please executegenerate_slatm.py
adequately by pointing to the xyz files (vide supdra) to generate therepr.npy
file containing the SLATM representations. -
predict
contains the scripts for inference using the trained models. The SLATM representations of the dimer data can be generated with the given script from the xyz files available in the Materials Cloud record. The output of the predictions is also given. To run, please executegenerate_slatm.py
adequately by pointing to the xyz files of the molecules to predict to generate therepr.npy
file containing the SLATM representations. -
substructures
contains the exact definition of the SMARTS keys used for substructure search. -
no_dimers
contains the models and cross validation scripts to test the effect of including (or not) the 2506 selected dimers in the training set using pretrained models. -
data_no_dimers
contains analogous content todata
without the 2506 selected dimers. -
data_dimers
contains analogous content todata
without the FORMED dataset. -
select_dimers
contains the source code used to select a diverse subset of 2506 dimers. The code is given as a jupyer notebook. To run the jupyter notebook you need to provide the conda environmentFORMED
(vide supra) by runningpython -m ipykernel install --user --name=FORMED
. After that, you should be able to run the jupyter notebook normally by selecting theFORMED
environment as kernel. Full functionality requires coupling site information obtained using thecrosscoupler
.