Chemperium

Portmanteau of the Latin words Chemia and Imperium: "a chemical empire".

Chemperium is a deep learning toolkit that aims to conquer the chemical space of compounds and properties. The main focus of this tool is on the applicability and accuracy of trained models. While many publications, tools, and datasets are out on molecular property prediction, we target both experts and non-experts in cheminformatics to make fast and accurate predictions.
In this package, we provide a validated software tool and trained machine learning models to make reliable molecular property predictions with a minimum of code and time.

1. Installation

Chemperium is built upon NumPy, Pandas, RDKit, TensorFlow, Keras, and Scikit-Learn. The package can be installed using pip:

Install a virtual environment in Anaconda:

conda create -n chemperium python=3.11
conda activate chemperium

git clone https://github.com/mrodobbe/chemperium.git
cd chemperium
pip install .

2. Usage

Chemperium can be loaded as a python package by importing chemperium. There are various options to predict properties of molecules or to train new models.

Predicting properties with chemperium

A distinction is made between liquid-phase properties and thermochemistry.

Liquid-phase properties

Liquid-phase properties are predicted with the module chemperium.training.predict.Liquid. In this module, it is necessary to specify the target property, the dimension of molecular information (2D or 3D), and the location of the trained models.

2D example for boiling point

import chemperium as cp

bp_model = cp.Liquid("bp", "2d", <folder>)
prediction = bp_model.predict("COc1ccccc1")

(currently supported properties: bp, tc, pc, vp, logp, logs)

Thermochemistry

The prediction of thermochemical properties is done in a similar way with the module chemperium.training.predict.Thermo. A distinction is made in the functions to predict enthalpy of formation, entropy of formation, and gibbs free energy of formation. It is possible to predict at temperatures between 298 K and 1500 K. When 3D predictions are chosen, Δ-machine learning will be used and a lower level-of-theory estimate should be provided. At this moment, all predictions are in kcal/mol for enthalpy and cal/mol/K for entropy.

3D example for CBS-QB3

import chemperium as cp

smi = "COc1ccccc1"
xyz = '16\n' \
      '\n' \
      'C          2.76930        0.32250       -0.00050\n' \
      'O          1.76340       -0.67620       -0.00000\n' \
      'C          0.45600       -0.27750       -0.00000\n' \
      'C         -0.49220       -1.31180       -0.00020\n' \
      'C         -1.84930       -1.01160       -0.00010\n' \
      'C         -2.28360        0.31900        0.00010\n' \
      'C         -1.33830        1.34160        0.00020\n' \
      'C          0.03130        1.05620        0.00020\n' \
      'H          3.72200       -0.21080       -0.00090\n' \
      'H          2.71000        0.95720       -0.89500\n' \
      'H          2.71080        0.95730        0.89390\n' \
      'H         -0.13750       -2.33800       -0.00030\n' \
      'H         -2.57430       -1.82150       -0.00030\n' \
      'H         -3.34470        0.55070        0.00010\n' \
      'H         -1.65940        2.38020        0.00040\n' \
      'H          0.74700        1.87060        0.00030'
llot = -5.13245

thermo = cp.Thermo("cbs-qb3", "3d", <folder>)

# Predict the standard enthalpy of formation at 298 K
h298_prediction = thermo.predict_enthalpy(smi, xyz, llot, quality_check=True)

# Predict the Gibbs free energy at 1000 K
g1000_prediction = thermo.predict_gibbs(smi, xyz, llot, t=1000)

# Predict the thermochemistry in Chemkin format
chemkin_inp = thermo.get_nasa_polynomials("anisole", smi, xyz, llot, chemkin=True)

Training machine learning models

The Thermo and Liquid modules are trained in advance. It is also possible to train models by yourself. For this purpose, the function chemperium.training.train.train is needed. It requires three arguments: the location of a CSV file with training data, a list with target properties, and (optionally) a dictionary with training arguments.

Training a 3D MPNN for prediction of logP and logS:

import chemperium as cp

csv_location = "examples/example_data.csv"
props = ["logp", "logs"]
save_dir = "examples/output"
input_args = {"rdf": True, 
              "cutoff": 2.1, 
              "num_layers": 3, 
              "hidden_size": 128, 
              "depth": 4}
cp.train(csv_location, props, save_dir, input_args)

Testing trained machine learning models

Property prediction models that have been trained with the function chemperium.training.train.train can be used for predicting properties using the module chemperium.training.test.test. The usage is highly resembling to the train function and requires following information:

smiles: a list with SMILES identifiers
prop: the target property/ies
save_dir: the folder where the models are stored
xyz: (optional) List with 3D coordinates of the target compounds
return_results: (optional) A bool that states whether results should be returned as DataFrame. Defaults to False
input_args: (optional) Dictionary with training arguments of the trained models

Testing a 3D MPNN for prediction of logP and logS:

import chemperium as cp

smi = ["COc1ccccc1"]
xyz = ['16\n' \
       '\n' \
       'C          2.76930        0.32250       -0.00050\n' \
       'O          1.76340       -0.67620       -0.00000\n' \
       'C          0.45600       -0.27750       -0.00000\n' \
       'C         -0.49220       -1.31180       -0.00020\n' \
       'C         -1.84930       -1.01160       -0.00010\n' \
       'C         -2.28360        0.31900        0.00010\n' \
       'C         -1.33830        1.34160        0.00020\n' \
       'C          0.03130        1.05620        0.00020\n' \
       'H          3.72200       -0.21080       -0.00090\n' \
       'H          2.71000        0.95720       -0.89500\n' \
       'H          2.71080        0.95730        0.89390\n' \
       'H         -0.13750       -2.33800       -0.00030\n' \
       'H         -2.57430       -1.82150       -0.00030\n' \
       'H         -3.34470        0.55070        0.00010\n' \
       'H         -1.65940        2.38020        0.00040\n' \
       'H          0.74700        1.87060        0.00030']
props = ["logp", "logs"]
save_dir = "examples/output"
input_args = {"rdf": True, 
              "cutoff": 2.1, 
              "num_layers": 3, 
              "hidden_size": 128, 
              "depth": 4}
results = cp.test(smi, props, save_dir, xyz, True, input_args)

Creating a learned representation

3. Scripts

It is also possible to train and test models via command line. Below, we show the example from Training and Testing.

Training a model via command line

The script can be found in scripts/train.py.

python train.py --data "examples/example_data.csv" --save_dir "examples/output" --property "logp,logs" 
--rdf --cutoff 2.1 --num_layers 3 --hidden_size 128 --depth 4

Testing a model via command line

python test.py --test_data "examples/example_test_data.csv" --save_dir "examples/output" --property "logp,logs" 
--rdf --cutoff 2.1 --num_layers 3 --hidden_size 128 --depth 4

4. Tutorial

A small demo notebook is available in notebooks/demo.ipynb.

5. Datasets

All datasets are available in the Zenodo repository.

6. Reference

When using chemperium for your own work, please refer to the original publication:
M. R. Dobbelaere, I. Lengyel, C. V. Stevens, and K. M. Van Geem, Geometric Deep Learning for Molecular Property Predictions with Chemical Accuracy Across Chemical Space, Submitted, 2024.

@ARTICLE{Dobbelaere2024,
  title     = "Geometric Deep Learning for Molecular Property Predictions with Chemical Accuracy Across Chemical Space",
  author    = "Dobbelaere, Maarten R and Lengyel, Istvan and Stevens,
               Christian V and Van Geem, Kevin M",
  journal   = "Submitted",
  year      =  2024,
  language  = "en"
}

Acknowledgments

This software tool has been developed with support from the Research Fund of Flanders (FWO-Vlaanderen, grant 1S45522N), the European Research Council (ERC grant 818607), and the European Union's Horizon Programme (grant 101057816, "TransPharm").

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
docs		docs
misc		misc
notebooks		notebooks
scripts		scripts
src/chemperium		src/chemperium
test_data		test_data
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
coverage-badge.svg		coverage-badge.svg
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chemperium

Table of Contents

1. Installation

2. Usage

Predicting properties with chemperium

Liquid-phase properties

Thermochemistry

Training machine learning models

Testing trained machine learning models

Creating a learned representation

3. Scripts

Training a model via command line

Testing a model via command line

4. Tutorial

5. Datasets

6. Reference

Acknowledgments

About

Releases

Packages

Languages

License

mrodobbe/chemperium

Folders and files

Latest commit

History

Repository files navigation

Chemperium

Table of Contents

1. Installation

2. Usage

Predicting properties with chemperium

Liquid-phase properties

Thermochemistry

Training machine learning models

Testing trained machine learning models

Creating a learned representation

3. Scripts

Training a model via command line

Testing a model via command line

4. Tutorial

5. Datasets

6. Reference

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages