RDClust: Clustering of rare diseases on knowledge graphs

Identifying sets of rare diseases with shared aspects of etiology and pathophysiology may enable drug repurposing and/or platform-based therapeutic development. Toward that aim, we utilize an integrative knowledge graph-based approach to constructing clusters of rare diseases.

Workflow:

Note: The workflow is designed and executed within an HPC slurm cluster environment. For more information, please see the various example notebooks provided.

The steps to reproducing the workflow are outlined below:

1. Set up environment and directory:

bash 00_setup_data.sh
conda env create -f rdclust.yml
conda activate rdclust
pip install -r requirements.txt

1. Get data: 01_get_public_data.sh

Note - The GARD data is currently NOT publicly accessible via API; therefore, we provide the necessary datasets in this repository (RD-Clust/data/raw/). When an API is publicly available, the workflow and `01_get_gard_data.sh` will be updated.

1. Pre-process the data: 02_process_ontologies.sh
1. Generate random walks: 03_walks_array.sh
1. Generate node embeddings: 04_embeddings_array.sh
1. Create clustering models: 05_cluster_array.sh
1. Post-hoc summaries
- Gene enrichment: 06_calculate_enrichment.sh
- Clustering metrics: 06_summarize_clusters.sh
- Walk annotation counts: 06_summarize_walks.sh
- Semantic similarity: 06_calculate_semantic_similarity.sh
1. Detailed analysis and Visualization in the notebooks directory

For quality check, we randomized graphs to assess how well disease nodes cluster when their relationships are not based on real knowledge. See QC directory for quality control pipeline *

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RDClust: Clustering of rare diseases on knowledge graphs

Workflow:

Note - The GARD data is currently NOT publicly accessible via API; therefore, we provide the necessary datasets in this repository (RD-Clust/data/raw/). When an API is publicly available, the workflow and `01_get_gard_data.sh` will be updated.

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.github/workflows		.github/workflows
data		data
fig		fig
notebooks		notebooks
qc		qc
src		src
third_party/bin		third_party/bin
.gitignore		.gitignore
00_setup_data.sh		00_setup_data.sh
01_get_gard_data.sh		01_get_gard_data.sh
01_get_public_data.sh		01_get_public_data.sh
02_process_ontologies.sh		02_process_ontologies.sh
03_walks_array.sh		03_walks_array.sh
04_embeddings_array.sh		04_embeddings_array.sh
05_cluster_array.sh		05_cluster_array.sh
06_calculate_enrichment.sh		06_calculate_enrichment.sh
06_calculate_semantic_similarity.sh		06_calculate_semantic_similarity.sh
06_summarize_clusters.sh		06_summarize_clusters.sh
06_summarize_walks.sh		06_summarize_walks.sh
LICENSE		LICENSE
README.md		README.md
rdclust.yml		rdclust.yml
requirements.txt		requirements.txt

License

ncats/RD-Clust

Folders and files

Latest commit

History

Repository files navigation

RDClust: Clustering of rare diseases on knowledge graphs

Workflow:

Note - The GARD data is currently NOT publicly accessible via API; therefore, we provide the necessary datasets in this repository (RD-Clust/data/raw/). When an API is publicly available, the workflow and 01_get_gard_data.sh will be updated.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Note - The GARD data is currently NOT publicly accessible via API; therefore, we provide the necessary datasets in this repository (RD-Clust/data/raw/). When an API is publicly available, the workflow and `01_get_gard_data.sh` will be updated.

Packages