This repository contains an implementation of the molecular graph deep sets (MolSets) model for molecular mixture properties, associated with our paper Learning molecular mixture property using chemistry-aware graph neural network.
models.py
and dmpnn.py
contain implementations of MolSets with standard graph convolutions and DMPNN, respectively.
main.py
, main_dmpnn.py
, and predict.py
are for evaluation and prediction; see Usage for details.
data_utils.py
is for processing molecular graph data.
data/
provides datasets used in the paper.
Details on datasets:
data_compiled.csv
contains cleaned raw data from the dataset curated in ACS Cent. Sci. 2023, 9, 2, 206–216.prepare_data.py
is for processing the raw data, e.g., converting SMILES to graphs.data_list.pkl
contains processed data from the dataset.- An integer index;
- A list of solvent molecular graphs in
torch_geometric.data.Data
format; - A list of solvent molecular weights (g/mol);
- A list of solvent weight fractions;
- Salt molality (mol/kg);
- Salt molecular graph;
- Logarithm conductivity at 298 K (log S/cm).
data_df_stats.pkl
organizes the data with some statistics inpandas.DataFrame
format.all_bin_candidates.pkl
contains the candidates (equal weight binary molecular mixture + 1 m salt) for virtual screening. Organized in the same way asdata_list.pkl
.
results
provides model checkpoints and saves files generated in runs.
*Note: Git LFS is required to download the .pkl
files properly. Please download them manually if you do not have Git LFS.
**Data handling is not yet optimized for efficiency. Contributions are welcome!
MolSets requires the following packages:
- PyTorch >= 2.0
- PyG (
torch_geometric
) - PyTorch Scatter (only for DMPNN)
The environment can be set up by running
conda env create -f environment.yml
However, there may be package compatibility issues that need manual corrections. CUDA and GPU-enabled versions of PyTorch and PyG are required to run on GPUs.
Use main.py
to train the MolSets model (with standard graph convolutions) or evaluate it on a dataset. Set the hyperparameters in hyperpars
, and the data path in dataset
, then run
(screen) python main.py
and see the results. Training may take minutes to hours depending on the device and data size. For the model with DMPNN, use main_dmpnn.py
instead, following similar procedures.
Use predict.py
to make inferences on candidate mixtures with a trained model. Specify the path to the candidate data file in candidate_data
and the model checkpoint file in model.load_
. Information about training data is needed if feature normalization is used, as in data_utils.py
.
After setup, run
python predict.py
and the predictions will be written in a .csv
file.
If you find this code useful, please consider citing the following paper:
@article{zhang2024molsets,
author = {Zhang, Hengrui and Lai, Tianxing and Chen, Jie and Manthiram, Arumugam and Rondinelli, James M. and Chen, Wei},
title = {Learning molecular mixture property using chemistry-aware graph neural network},
journal = {PRX Energy},
year = {2024},
volume = {3},
number = {2},
pages = {023006},
doi = {10.1103/PRXEnergy.3.023006}
}