Public repository for SMCDC21 Challenge 2 submission (accepted for presentation at SMC2021).
Motivated by Challenge 2 of the 5th Annual Smoky Mountains Computational Sciences Data Challenge, we analyze the COVID-19 biomedical knowledge graph. After computing geodesic statistics for all nodes in the network, we present several machine learning pipelines for automated hypothesis generation.
This repo is organized as follows:
covid19-link-prediction
├── classifiers # saved classification models
├── data
│ ├── betweenness # pre-computed betweenness data
│ ├── embeddings # trained DeepWalk embeddings
│ ├── graph # pickled NetworkX graphs
│ ├── og # dir for provided challenge data
│ ├── other # dir for intermediary data/computations
│ ├── shortest_paths # pre-computed APSP data
│ ├── training # matrices used for training
│ └── validation # matrices used for validation
├── dev code # assorted dev notebooks
├── misc # for assorted supplemental files (currently only the kg_browser image)
├── paper # submission pre-print
└── kg_browser.py # utils for browsing processed data/using models/etc.
- Clone this repo
- Download shortest paths data from here (~6 GB) and relocate it to \smcdc-2021-2\data\shortest_paths (see Repo Organization)
- Download validation data from here (~2/3 GB) and relocate it to \smcdc-2021-2\data\validation (see Repo Organization)
- Download challenge dataset from here and relocate it to \smcdc-2021-2\data\og (see Repo Organization)
- cd into the covid19-link-prediction directory and run the following in your shell:
pip install -r requirements.txt
We provide kg_browser
, a convenient utility interface for accessing our processed data and models.
For further examples, see the kg_browser
demo notebook here.
Our top 1000 proposed novel relations may be found here. Here are the top 10:
+-----------------------+------------------------------+
| Edge | Estimated Link Probability |
+=======================+==============================+
| C0035236 <-> C1441604 | 0.999488 |
+-----------------------+------------------------------+
| C0027362 <-> C0020967 | 0.999487 |
+-----------------------+------------------------------+
| C0003062 <-> C0012754 | 0.999484 |
+-----------------------+------------------------------+
| C0086418 <-> C0027934 | 0.999484 |
+-----------------------+------------------------------+
| C0006104 <-> C0333230 | 0.999484 |
+-----------------------+------------------------------+
| C1314650 <-> C2700280 | 0.999464 |
+-----------------------+------------------------------+
| C0543467 <-> C0265883 | 0.99945 |
+-----------------------+------------------------------+
| C0582175 <-> C2697883 | 0.999444 |
+-----------------------+------------------------------+
| C1320226 <-> C0401805 | 0.999403 |
+-----------------------+------------------------------+
| C0206031 <-> C0038454 | 0.99938 |
+-----------------------+------------------------------+
This project was created with:
powerlaw==1.4.6
scikit_network==0.23.1
numpy==1.21.1
networkx==2.5
pandas==1.2.4
seaborn==0.11.0
matplotlib==3.3.2
scipy==1.7.0
joblib==0.17.0
- Lucas Hurley McCabe (email)