Info

This repository contains algorithms for QSAR modelling developed at the group of Dr. Sophia Tsoka at King's College London.

Our most recent paper Optimal Piecewise Regression Algorithm for QSAR Modelling has been published at Wiley's Molecular Informatics and can be found here.

In this study, we have developed an algorithm based on mathematical programming to build QSAR models. The algorithm is written in Python and can be applied to any regression analysis, details about how to download and use the oplrareg package can be found at the link https://github.com/KISysBio/OPLRAreg

This repository contains the data files used in the study.

QSAR data sets were used in this study

We tested OPLRAreg on 5 data sets downloaded from ChEMBL:

rDHFR: Dihydrofolate reductase (Rattus norvegicus)
hDHFR: Dihydrofolate reductase (Homo sapiens)
CHRM3: Muscarinic acetylcholine receptor M3
NPYR1: Neuropeptide Y receptor type 1
NPYR2: Neuropeptide Y receptor type 2

Preprocessing: Where does these data files come from?

First, we have downloaded data about these data sets from ChEMBL, selecting the compounds which had their inhibitory activity against the common protein targets shown above measured by the $IC_50$ metric ¹.

Then, we preprocessed the data by merging entries of duplicated molecules and filtering out entries where the activities could not be entirely trusted using the column Data Validity from ChEMBL.

The next step was to generate 200+ molecular descriptors using the Java Chemistry Development Kit (CDK) from within R (the great package rcdk).

A list of all molecular descriptors can be found from CDK documentation here.

The spreadsheets under directory data represent the final files after this preprocessing step.

Why did you not just stick with Python and use RDKit as the cheminformatics library?

At the beginning of the project, I (Jonathan) was not very familiar with Python. R was my default language for programming and data exploration and we were using GAMS as the programming language for developing the algorithm.

Later it became clearer that we could create the same algorithm in Python using Pyomo, which would be much more accesible since not every one owns a GAMS license. It was only then that we ported part of the project to Python.

So, because of that, parts of the pipeline of this project such as molecular descriptors and data preprocessing was implemented in R. Cross-validation and comparison to other machine learning algorithms was also implemented in R to take advantage of the functionalities of the caret package.

We will update this repository to include the preprocessing steps as well.

I am currently working on translating everything to Python and we will possibly use RDKit for most of the cheminformatics tasks.

What are the data splits?

We split the data sets at random into 5 different test studies. At each test, 75% of samples were used to build and select the regression models under cross-validation while the remaining 25% were used as external validation set.

We wanted to compare algorithms using the same cross-validation folds and external set samples so the files under the directory data_splits contain the indices of all tests. This includes the indices for the 10 batches of 10-fold cross-validation performed on the internal set of samples.

1 Using $IC_50$ as the sole measure of effectiveness was a decision made earlier in the project. The same pipeline would work perfectly fine with other metrics such as $K_i$ or by combining the log of concentrations measured with $K_i$ + $IC_50$. ↩

Contact

Sophia Tsoka: sophia.tsoka@kcl.ac.uk Jonathan Cardoso-Silva: jonathan.silva@kcl.ac.uk

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
data_splits		data_splits
notebooks		notebooks
preprocessing		preprocessing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Info

QSAR data sets were used in this study

Preprocessing: Where does these data files come from?

Why did you not just stick with Python and use RDKit as the cheminformatics library?

What are the data splits?

Contact

About

Releases

Packages

Languages

License

KISysBio/qsar-models

Folders and files

Latest commit

History

Repository files navigation

Info

QSAR data sets were used in this study

Preprocessing: Where does these data files come from?

Why did you not just stick with Python and use RDKit as the cheminformatics library?

What are the data splits?

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages