Identifying sets of rare diseases with shared aspects of etiology and pathophysiology may enable drug repurposing and/or platform-based therapeutic development. Toward that aim, we utilize an integrative knowledge graph-based approach to constructing clusters of rare diseases.
Note: The workflow is designed and executed within an HPC slurm cluster environment. For more information, please see the various example notebooks provided.
The steps to reproducing the workflow are outlined below:
-
- Set up environment and directory:
bash 00_setup_data.sh
conda env create -f rdclust.yml
conda activate rdclust
pip install -r requirements.txt
-
- Get data:
01_get_public_data.sh
- Get data:
Note - The GARD data is currently NOT publicly accessible via API; therefore, we provide the necessary datasets in this repository (RD-Clust/data/raw/). When an API is publicly available, the workflow and 01_get_gard_data.sh
will be updated.
-
- Pre-process the data:
02_process_ontologies.sh
- Pre-process the data:
-
- Generate random walks:
03_walks_array.sh
- Generate random walks:
-
- Generate node embeddings:
04_embeddings_array.sh
- Generate node embeddings:
-
- Create clustering models:
05_cluster_array.sh
- Create clustering models:
-
- Post-hoc summaries
- Gene enrichment:
06_calculate_enrichment.sh
- Clustering metrics:
06_summarize_clusters.sh
- Walk annotation counts:
06_summarize_walks.sh
- Semantic similarity:
06_calculate_semantic_similarity.sh
-
-
Detailed analysis and Visualization in the notebooks directory
-
- For quality check, we randomized graphs to assess how well disease nodes cluster when their relationships are not based on real knowledge. See QC directory for quality control pipeline *