Skip to content

🧬 Process repository for my master thesis attacking drug response prediction on cancer cell-lines using bi-modal graph neural networks

License

Notifications You must be signed in to change notification settings

PeeteKeesel/gnn-for-drug-response-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Graph Neural Networks for Drug Response Prediction in Cancer 🧬

Example One-Hop Neighbor Plot for Gene Symbol E2F2

πŸ’‘ Introduction

This repository contains the process and the final code for my master thesis "Gene-Interaction Graph Neural Network to Predict Cancer Drug Response".

Table of Contents

πŸ’» Environment Setup

Using conda

To create the virtual environment via conda run

# Option 1: by using the environment.yml file (recommended).
conda env create -n ENVNAME --file environment.yml

# Option 2: by using the requirement.txt file.
conda create -n ENVNAME --file requirements.txt

Now activate the environment.

conda activate ENVNAME

Using pip

  • This may not work yet. I have only tested the conda method yet.

To create and activate the virtual environment via pip run

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

⏬ Download Raw Datasets

To download the raw datasets which can be used to create the training datasets run

from pathlib import Path
from src.preprocess.build_features import Processor

# Choose the following two paths by yourself! Here are just examples.
RAW_PATH = '../../datatest/raw/'
PROCESSED_PATH = '../../datatest/processed/'

Path(RAW_PATH).mkdir(parents=True, exist_ok=True)
Path(PROCESSED_PATH).mkdir(parents=True, exist_ok=True)

processor = Processor(raw_path=RAW_PATH,
                      processed_path=PROCESSED_PATH)

processor.download_raw_datasets()

For raw_path and processed_path we recommend to choose a folder outside of this repository since some files are very large (>100MB).

πŸƒ How To Run

There are multiple parameters which can be chosen to set when running the main.py. An example call would look like this:

python3 main.py \
    seed=42 \
    batch_size=1000 \
    lr=0.0001 \
    train_ratio=0.8 \
    val_ratio=0.5 \
    num_epochs=2 \
    num_workers=8 \
    dropout=0.1 \ 

All supported arguments are listed below:

usage: 
  main.py [--seed] [--batch_size] [--lr] [--train_ratio] [--val_ratio] [--num_epochs] 
          [--num_workers] [--dropout] [--model] [--version] [--download] [--process] 
          [--raw_path] [--processed_path] [--combined_score_thresh] [--gdsc]

optional arguments:
  --seed                    the random seed (for reproducibility)
  --batch_size              the size of each batch
  --lr                      learning rate
  --train_ratio             train set ratio
  --val_ratio               validation set ratio. (1-val_ratio) will be the test set ratio
  --num_epochs              number of epochs
  --num_workers             number of workers for DataLoader
  --dropout                 dropout probability
  --model                   name of the model to use
  --version                 model version to use
  --download                if enabled, the raw data will be download and saved in the 
                            raw path
  --process                 if enabled, the data in the raw path will be processed and 
                            saved in the processed path
  --raw_path                path to the raw datasets
  --processed_path          path to the processed datasets
  --combined_score_thresh   threshold below which to cut off the gene-gene interactions
  --gdsc                    the type of GDSC database to use for training
Argument Default Options Description Notes
--download n {n, y} To download the raw datasets. This only should(/needs to) be done once.
--raw_path ../data/raw/ Any path Path to save the raw datasets to. The default is lying out of this repository since the raw files are very large.
--processed_path ../data/raw/ Any path Path to save the processed datasets to. The default is also lying out of this repository to have a joined data folder which contains raw and processed datasets.

πŸ“š Contents

Notebooks

Notebook Content
02_GDSC_map_GenExpr.ipynb Contains the code for the creation of the base dataset containing gene expressions for cell-line drug combinations.
03_GDSC_map_CNV.ipynb Contains the code for the creation of the base dataset containing gistic and picnic copy numbers for cell-line drug combinations.
04_GDSC_map_mutations.ipynb
05_DrugFeatures.ipynb
06_create_base_dataset.ipynb
07_Linking.ipynb Contains the code for the creation of the graph using the STRING database.
07_v1_2_get_linking_dataset.ipynb Creates the graph per cell-line with all 4 node features (gene expr, cnv gistic, cnv picnic and mutation). Topology per cell-line graph is Data(x=[858, 4], edge_index=[2, 83126]).
07_v2_graph_dataset.ipynb Used only the gene-gene tuples with a combined_score value of more then 950. Ended up with only 458 genes per cell-line for now (instead of 858 as of before). Filtered 1st by the combined_score and than by the landmark genes. Topology per cell-line graph is Data(x=[458, 4], edge_index=[2, 4760]).
07_v3_graph_dataset.ipynb First select only the landmark genes from the protein-protein interaction table. Then tune the threshold for the combined_score column according to how many unique genes would be left.
11_v1_GraphTab_sparse_1.ipynb Used the dataset from 07_v2_graph_dataset.ipynb having topology per cell-line graph of Data(x=[458, 4], edge_index=[2, 4760])
11_v1_GraphTab_nonsparse.ipynb Used the dataset from 07_v1_2_get_linking_dataset.ipynb having topology per cell-line graph of Data(x=[858, 4], edge_index=[2, 83126]). Took too long per epoch, which is why different approaches needed to be found to sparse the number of edges in the graph (see combined_score approach).

πŸ“† Todos

  • fix error with mutations dataset
  • include mutation features in tensor
  • start building building bi-modal network structure and build simple NN
  • correct DataLoader to access cell-line-gene-drug-ic50 tuple correctly
  • shuffle
  • Run TabGraph with choosing an appropriate GNN layer type (see 12_v1_TabTab.ipynb)
  • For GraphTab and GraphGraph use the combined_score column to sparse down the connections between the genes by using an appropriate threshold, e.g. 0.95*1_000 (see 07_v2_graph_dataset.ipynb)
    • problem: number of genes got decreased to 484 from 858
    • Filter first by landmark genes and than by the combined_score (and not the other way around as done in 07_v2_graph_dataset.ipynb) (see TODO)
  • Save also the other performance metrics per epoch (r, r2, mae, rmse)
  • Parallelize code using num_workers > 0
    • once all models are running, convert to .py files instead of notebooks
  • Convert to non-notebook .py code
    • include setting of args from terminal
  • Log the outputs to a different file in the performances folder instead of printing
  • track and save run-time per epoch in the performance output
  • Include combined_score threshold in the args & in the processor
  • Include gdsc database filter in the args & in the processor
    • for now working for GDSC2
  • (optional for now) include GDSC 1 data as well; check shift in ln(IC50)'s and think about strategy to meaningful combine both in single dataset (see TODO)
    • Combine both GDSC1 and GDSC2 in an complete dataset to increase training data (see TODO)
  • run GNNExplainer on the graph branches of the bi-modal networks
  • Include dropout parameter from args to the networks
  • Include args to performe multiple experiments for different seeds per model
  • Run Graph-Tab approach with GAT instead of GCN
    • model is not overfitted anymore; also remved batch normalization between the GAT layers and replaced global_mean_pool with global_max_pool
  • Print NN architecture in each log-file.

Networks:

πŸ‘€ Questions

  • How to use the given RMSE in the drug-response matrix?
  • How can I improve performance of cell-GNN models? More genes (but runtime too low then)?
  • Why is cell-line graph approach overfitting a lot?

About

🧬 Process repository for my master thesis attacking drug response prediction on cancer cell-lines using bi-modal graph neural networks

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published