This repo contains Python implementations for clustering spike protein sequences into recognized variants, aiding study on the spread and dynamics of rapidly emerging variants.
The spike sequences were retrieved from a publicly available dataset of SARS-CoV-2 spike sequences, which were then stored in a NumPy binary file (filtered_protein_seq.npy
) for efficient data storage and read time.
To begin with, a
Finally, both hard and soft clustering methods are then used on these feature vectors, achieving high
First, clone this repo to your local machine. Then, create a new conda environment with its dependencies by executing the following syntax in terminal or console.
Note: This repo is accelerated with Intel(R) Extension for Scikit-learn. So, make sure to clone this repo on Intel-powered local machine.
conda env create --file requirements.yml
After a new conda environment created, activate the environment by
conda activate sars-cov-2_clustering
Fuzzy C-Means clustering method is still not available in conda package manager. While in the activated sars-cov-2_clustering
environment, install fuzzy-c-means package with pip
.
pip install fuzzy-c-means
First, as stated in What is this repo about?, a
Ensure that your current working directory is set to the cloned repo.
python featureVec_Generator.py
After generating and saving feature vectors in a NumPy binary files, clustering can be performed using the methods in each respective folder.
Each respective clustering method folder includes the following:
- org/
- Clustering without feature selection applied to the feature vectors.
- Results in a longer runtime and lower
$F_1$ score due to high dimensionality.
- Boruta/
- Clustering after applying Boruta.
- Boruta is a supervised method that is made around the random forest (RF) classification algorithm.
- This works by creating shadow features so that the features do not compete among themselves, but rather they compete with randomized version of them.
- It then extracts the importance of each feature (corresponding to the class label) and only keeps the features that are above specific threshold of importance.
- Lasso/
- Clustering after applying Least Absolute Shrinkage and Selection Operator (LASSO) regression.
- LASSO is a specific case of the penalized least squares regression with
$L_1$ penalty function. - By combining the good qualities of ridge regression and subset selection, LASSO can improve both model interpretability, and prediction regression.
- RFT/
- Clustering after applying a kernel-based method called Random Fourier Features (RFT).
- RFT is an unsupervised approach, which maps the input data to a randomized low dimensional feature space (Euclidean inner product space) to get an approximate representation of data in lower dimensions
$D$ from the original dimensions$d$ .
To perform clustering on feature vectors, run:
python clusteringMethod_featureSelection.py
Here, replace clusteringMethod
with kmeans
, kmodes
, or fuzzy
and featureSelection
with org
, Boruta
, Lasso
, or RFT
.
Next, create a contingency table by executing:
python new_cnt.py
Finally, calculate
python Calculate_F1.py
The following are the results after running the repo on MacBook Pro 2020 with Intel(R) i7 Core Processor. There are 62657 observations with 9261 features (due to k-mers with k set equal to 3).
Results for K-Modes and Fuzzy C-Means clustering on original feature vectors (without feature selection) are not available due to predicted runtimes exceeding a day.
The plots show high
The runtime comparison shows K-Modes as the slowest clustering method. K-Means with LASSO feature selection method is the fastest but doesn't achieve the higher
You can experiment with different methods and even larger spike protein sequence datasets, maybe also includes newer variants like Omicron.