Graph Neural Networks for Drug Response Prediction in Cancer 🧬

💡 Introduction

This repository contains the process and the final code for my master thesis "Gene-Interaction Graph Neural Network to Predict Cancer Drug Response".

💻 Environment Setup

Using conda

To create the virtual environment via conda run

# Option 1: by using the environment.yml file (recommended).
conda env create -n ENVNAME --file environment.yml

# Option 2: by using the requirement.txt file.
conda create -n ENVNAME --file requirements.txt

Now activate the environment.

conda activate ENVNAME

Using pip

This may not work yet. I have only tested the conda method yet.

To create and activate the virtual environment via pip run

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

⏬ Download Raw Datasets

To download the raw datasets which can be used to create the training datasets run

from pathlib import Path
from src.preprocess.build_features import Processor

# Choose the following two paths by yourself! Here are just examples.
RAW_PATH = '../../datatest/raw/'
PROCESSED_PATH = '../../datatest/processed/'

Path(RAW_PATH).mkdir(parents=True, exist_ok=True)
Path(PROCESSED_PATH).mkdir(parents=True, exist_ok=True)

processor = Processor(raw_path=RAW_PATH,
                      processed_path=PROCESSED_PATH)

processor.download_raw_datasets()

For raw_path and processed_path we recommend to choose a folder outside of this repository since some files are very large (>100MB).

For details on the dataset sizes and contents read data/README.md.
For an example of how to run this code refer to notebooks/download_raw_datasets.ipynb.

🏃 How To Run

There are multiple parameters which can be chosen to set when running the main.py. An example call would look like this:

python3 main.py \
    seed=42 \
    batch_size=1000 \
    lr=0.0001 \
    train_ratio=0.8 \
    val_ratio=0.5 \
    num_epochs=2 \
    num_workers=8 \
    dropout=0.1 \

All supported arguments are listed below:

usage: 
  main.py [--seed] [--batch_size] [--lr] [--train_ratio] [--val_ratio] [--num_epochs] 
          [--num_workers] [--dropout] [--model] [--version] [--download] [--process] 
          [--raw_path] [--processed_path] [--combined_score_thresh] [--gdsc]

optional arguments:
  --seed                    the random seed (for reproducibility)
  --batch_size              the size of each batch
  --lr                      learning rate
  --train_ratio             train set ratio
  --val_ratio               validation set ratio. (1-val_ratio) will be the test set ratio
  --num_epochs              number of epochs
  --num_workers             number of workers for DataLoader
  --dropout                 dropout probability
  --model                   name of the model to use
  --version                 model version to use
  --download                if enabled, the raw data will be download and saved in the 
                            raw path
  --process                 if enabled, the data in the raw path will be processed and 
                            saved in the processed path
  --raw_path                path to the raw datasets
  --processed_path          path to the processed datasets
  --combined_score_thresh   threshold below which to cut off the gene-gene interactions
  --gdsc                    the type of GDSC database to use for training

Argument	Default	Options	Description	Notes
`--download`	`n`	`{n, y}`	To download the raw datasets.	This only should(/needs to) be done once.
`--raw_path`	`../data/raw/`	Any path	Path to save the raw datasets to.	The default is lying out of this repository since the raw files are very large.
`--processed_path`	`../data/raw/`	Any path	Path to save the processed datasets to.	The default is also lying out of this repository to have a joined `data` folder which contains raw and processed datasets.

📚 Contents

Notebooks

Notebook	Content
`02_GDSC_map_GenExpr.ipynb`	Contains the code for the creation of the base dataset containing gene expressions for cell-line drug combinations.
`03_GDSC_map_CNV.ipynb`	Contains the code for the creation of the base dataset containing gistic and picnic copy numbers for cell-line drug combinations.
`04_GDSC_map_mutations.ipynb`
`05_DrugFeatures.ipynb`
`06_create_base_dataset.ipynb`
`07_Linking.ipynb`	Contains the code for the creation of the graph using the STRING database.
`07_v1_2_get_linking_dataset.ipynb`	Creates the graph per cell-line with all 4 node features (gene expr, cnv gistic, cnv picnic and mutation). Topology per cell-line graph is `Data(x=[858, 4], edge_index=[2, 83126])`.
`07_v2_graph_dataset.ipynb`	Used only the gene-gene tuples with a `combined_score` value of more then 950. Ended up with only 458 genes per cell-line for now (instead of 858 as of before). Filtered 1st by the `combined_score` and than by the landmark genes. Topology per cell-line graph is `Data(x=[458, 4], edge_index=[2, 4760])`.
`07_v3_graph_dataset.ipynb`	First select only the landmark genes from the protein-protein interaction table. Then tune the threshold for the `combined_score` column according to how many unique genes would be left.
`11_v1_GraphTab_sparse_1.ipynb`	Used the dataset from `07_v2_graph_dataset.ipynb` having topology per cell-line graph of `Data(x=[458, 4], edge_index=[2, 4760])`
`11_v1_GraphTab_nonsparse.ipynb`	Used the dataset from `07_v1_2_get_linking_dataset.ipynb` having topology per cell-line graph of `Data(x=[858, 4], edge_index=[2, 83126])`. Took too long per epoch, which is why different approaches needed to be found to sparse the number of edges in the graph (see `combined_score` approach).

📆 Todos

Networks:

👀 Questions

How to use the given RMSE in the drug-response matrix?
How can I improve performance of cell-GNN models? More genes (but runtime too low then)?
Why is cell-line graph approach overfitting a lot?

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.vscode		.vscode
PROTEINS		PROTEINS
__pycache__		__pycache__
data		data
docs		docs
env		env
external		external
imgs		imgs
notebooks		notebooks
performances		performances
pytorch		pytorch
src		src
utils		utils
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
01_gdsc_base_table.ipynb		01_gdsc_base_table.ipynb
01_test.ipynb		01_test.ipynb
01_v1_EDA.ipynb		01_v1_EDA.ipynb
02_GDSC_map_GenExpr.ipynb		02_GDSC_map_GenExpr.ipynb
03_GDSC_map_CNV.ipynb		03_GDSC_map_CNV.ipynb
04_GDSC_map_mutations.ipynb		04_GDSC_map_mutations.ipynb
04_v2_mutations.ipynb		04_v2_mutations.ipynb
05_DrugFeatures.ipynb		05_DrugFeatures.ipynb
06_create_base_dataset.ipynb		06_create_base_dataset.ipynb
07_Linking.ipynb		07_Linking.ipynb
07_v1_get_linking_dataset.ipynb		07_v1_get_linking_dataset.ipynb
07_v2_graph_dataset.ipynb		07_v2_graph_dataset.ipynb
07_v3_graph_dataset.ipynb		07_v3_graph_dataset.ipynb
07_v3_graph_dataset__THESIS-Copy1.ipynb		07_v3_graph_dataset__THESIS-Copy1.ipynb
07_v3_graph_dataset__THESIS.ipynb		07_v3_graph_dataset__THESIS.ipynb
08_baseline.ipynb		08_baseline.ipynb
09_baseline_clean.ipynb		09_baseline_clean.ipynb
10_basic_baselines.ipynb		10_basic_baselines.ipynb
11_v1_GraphTab.ipynb		11_v1_GraphTab.ipynb
11_v1_GraphTab_nonsparse.ipynb		11_v1_GraphTab_nonsparse.ipynb
11_v2_GraphTab_sparse.ipynb		11_v2_GraphTab_sparse.ipynb
11_v3_GraphTab.ipynb		11_v3_GraphTab.ipynb
12_v1_TabTab.ipynb		12_v1_TabTab.ipynb
13_v1_TabGraph.ipynb		13_v1_TabGraph.ipynb
13_v2_TabGraph.ipynb		13_v2_TabGraph.ipynb
14_v1_GraphGraph.ipynb		14_v1_GraphGraph.ipynb
15_summary_datasets.ipynb		15_summary_datasets.ipynb
15_summary_datasets_EDA.ipynb		15_summary_datasets_EDA.ipynb
16_simple_regression.ipynb		16_simple_regression.ipynb
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TODO.md		TODO.md
TabTab.py		TabTab.py
Untitled.ipynb		Untitled.ipynb
config.py		config.py
dataset_summaries.ipynb		dataset_summaries.ipynb
download_test.py		download_test.py
env.yml		env.yml
env280123.yml		env280123.yml
env2803.yml		env2803.yml
envPerformancesNb020423.yml		envPerformancesNb020423.yml
environment1804.yml		environment1804.yml
environment_old.yml		environment_old.yml
environment_old2.yml		environment_old2.yml
main.py		main.py
main_80_20.py		main_80_20.py
main_80_20_bayes.py		main_80_20_bayes.py
main_80_20_temp.py		main_80_20_temp.py
main_early_stopping.py		main_early_stopping.py
main_v2.py		main_v2.py
main_v3.py		main_v3.py
requirements.txt		requirements.txt
setup.py		setup.py
temp.py		temp.py
v1_NormalNN.py		v1_NormalNN.py
v2_ConvNN.py		v2_ConvNN.py
v3_GCN.py		v3_GCN.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graph Neural Networks for Drug Response Prediction in Cancer 🧬

💡 Introduction

Table of Contents

💻 Environment Setup

Using conda

Using pip

⏬ Download Raw Datasets

🏃 How To Run

📚 Contents

Notebooks

📆 Todos

👀 Questions

About

Releases

Packages

Languages

License

PeeteKeesel/gnn-for-drug-response-prediction

Folders and files

Latest commit

History

Repository files navigation

Graph Neural Networks for Drug Response Prediction in Cancer 🧬

💡 Introduction

Table of Contents

💻 Environment Setup

Using conda

Using pip

⏬ Download Raw Datasets

🏃 How To Run

📚 Contents

Notebooks

📆 Todos

👀 Questions

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages