This repository contains algorithms for QSAR modelling developed at the group of Dr. Sophia Tsoka at King's College London.
Our most recent paper Optimal Piecewise Regression Algorithm for QSAR Modelling
has been published at Wiley's Molecular Informatics and can be found here.
In this study, we have developed an algorithm based on mathematical programming to build QSAR models. The algorithm is written in Python and can be applied to any regression analysis, details about how to download and use the oplrareg
package can be found at the link https://github.com/KISysBio/OPLRAreg
This repository contains the data files used in the study.
We tested OPLRAreg
on 5 data sets downloaded from ChEMBL:
- rDHFR: Dihydrofolate reductase (Rattus norvegicus)
- hDHFR: Dihydrofolate reductase (Homo sapiens)
- CHRM3: Muscarinic acetylcholine receptor M3
- NPYR1: Neuropeptide Y receptor type 1
- NPYR2: Neuropeptide Y receptor type 2
First, we have downloaded data about these data sets from ChEMBL, selecting the compounds which had their inhibitory activity against the common protein targets shown above measured by the
Then, we preprocessed the data by merging entries of duplicated molecules and filtering out entries where the activities could not be entirely trusted using the column Data Validity
from ChEMBL.
The next step was to generate 200+ molecular descriptors using the Java Chemistry Development Kit (CDK) from within R (the great package rcdk).
A list of all molecular descriptors can be found from CDK documentation here.
The spreadsheets under directory data represent the final files after this preprocessing step.
At the beginning of the project, I (Jonathan) was not very familiar with Python. R was my default language for programming and data exploration and we were using GAMS as the programming language for developing the algorithm.
Later it became clearer that we could create the same algorithm in Python using Pyomo, which would be much more accesible since not every one owns a GAMS license. It was only then that we ported part of the project to Python.
So, because of that, parts of the pipeline of this project such as molecular descriptors and data preprocessing was implemented in R. Cross-validation and comparison to other machine learning algorithms was also implemented in R to take advantage of the functionalities of the caret package.
We will update this repository to include the preprocessing steps as well.
I am currently working on translating everything to Python and we will possibly use RDKit for most of the cheminformatics tasks.
We split the data sets at random into 5 different test studies. At each test, 75% of samples were used to build and select the regression models under cross-validation while the remaining 25% were used as external validation set.
We wanted to compare algorithms using the same cross-validation folds and external set samples so the files under the directory data_splits contain the indices of all tests. This includes the indices for the 10 batches of 10-fold cross-validation performed on the internal set of samples.
1 Using
Sophia Tsoka: sophia.tsoka@kcl.ac.uk Jonathan Cardoso-Silva: jonathan.silva@kcl.ac.uk