Repository containing data, code and walkthrough for methods in the paper Representations of lipid nanoparticles using large language models for transfection efficiency prediction.
We aim to design LNPs for delivery of mRNA therapeutics. With this goal in mind we therefore define a successful LNP as one that has following attributes: biodegradeble, non-toxic, synthesizable, stable (pH/temperature) and (most importantly) transfection efficiency. We design predictive models that estimate these qualities with the intention of using them in screening/selecting the best candidates for experimental testing and development.
We reproduce the results of Ding et al., 2023 in this repository. We then also compare how are methods compare on this benchmark dataset.
Dependency management is done via poetry.
pip install poetry
pip install tensorflow # tensorflow is installed separately
poetry install
We organize our code into the following structure:
data
: contains all data used in our experiments. This includes CSVs with lipids SMILES and properties, and corresponding fingerprint JSON files from different embedding methods. Those ending in_alldata
correspond to SMILES from Ding et al.src
: contains the source code for downloading data from Ding et al., splitting data for multiclass classification, running the tournament script for classification, and some helpers.notebooks
: contains notebooks for data exploration, as well as gcn training and embedding extraction, and downstream classification with different methods.
We found that the best performing model for predicting LNP transfection efficiency relies on embedding LNPs with a large language model. Here we outline our method for finetuning NVIDIA's MegaMolBART model on the swisslipid dataset.
- Download, install and setup the MegaMolBART pre-trained model. We recommend installing using the container.
- Download the swisslipid dataset
- Edit the model configuration file such that (i.e.
megamolbart_pretrain_base.yaml
):
restore_from_path
points to the path of the pre-trained MegaMolBART.nemo
file- adjust
trainer
andbatch_size
params optimal to your system settings (number and memory of GPUs) dataset_path
is set to the location of the downloaded swisslipid dataset.- edit the swisslipid dataset such that it conforms to the datamodel of the MegaMolBART example dataset. Split the data into 99% training 0.5% test, 0.5% validation
- Run the pre-training script with this configuration (instructions here)
- Stop training once validation_molecular_accuracy converges