With more than half a million Canadians living with Alzheimer's disease combined with a fast-growing senior population, the Canadian healthcare system is in dire need of a way to better diagnose Alzheimer's disease patients and treat them earlier.
Current-day diagnostic treatments rely on clinical diagnosis that are unable to fully diffrentiate the different types of dementias that have molecular differences rather than clinical differences. Inaccurate diagnoses is problematic, as it often leads to treatments that have, at best, no effect on the progression of Alzheimer's. Due to this unique problem, researchers are trying to use big data methodologies to identify chemicals in our cerebrospinal fluid (CSF) and blood to act as accurate biomarkers.
These current-day methodologies involved 3 steps:
- Sampling: patients with that have different conditions are sampled
- Metabolomics: high-throughput machines like LC-MS are used to find small chemicals and measure concentrations
- Statistical analysis: principal component analysis or factor analysis is used to find significant biomarkers.
Recently, non-negative matrix factorization (NMF) has been seen as a viable alternative to principal component analysis. NMF is able to create collections of k meta-metabolites that are able to significantly diffrentiate between different diagnoses. Unlike PCA, meta-metabolites are intrepretable and robust (Metagene projection)
The objective of this project is to determine the efficacy of NMF in identifying key biomarkers fo Alzheimer's disease. This will be tested through the use of machine learning algorithms: a higher test accuracy indicates that a particular dimensionality reduction technique can successfully diffrentiate between different diagnoses.
All data is pulled from the open-source Alzheimer's disease database, ADNI. Specifically, 566 patients and 148 proteins were included in the database (AD, healthy and MCI patients). The following procedures were conducted
- All non-negative data was replaced with 0
- Any rows with missing data were removed
- Ceilings were used for extremely high outliers
- Divide data into 70% training and 30% testing
- NMF:
- Created H and W matrices from training set with k = 5
- Convert test set into same hyperspace as training set using pseudo-inverses
- PCA:
- Construct components with variance maximized for training data
- Scale test set onto same components
- Rank metabolites in each NMF component and extract top 20 unique metabolites
- Use ANOVA to find metabolites that are significantly different among the three groups
- Use Tukey analysis for metabolites that are deemed statistically different under ANOVA to determine where they are significantly different
- Visualize significantly different metabolites using violin plots, boxplots and correlation heatmaps
- Trained random forest, extreme gradient boosting, K-nearest neighbors and support vector machine and evaluate on test set
- Ensured random grid search with up-sampling of under-represented classes
- Collect predictions of all models on training and test set. Use predictions to train an extreme gradient boosting classifier as an ensemble model.
- Create permutation independence plots to determine the most significant features
- Create partial dependence plots for a single important feature to determine how feature affects model
- Use Shapley game theory values to determine how model predicts on a single example
I had three major results from this project:
- Identified new biomarkers that are significantly different in concentration levels between normal, MCI and Alzheimer's patients under ANOVA and Tukey Analysis
- Created classification models that had an F1 score greater than 0.98, incredibly accurate
- Using interpretability plots, I determined how each biomarker affects the probability of an AD, MCI or healthy diagnosis, which would be incredibly useful for doctors in understanding my models.
This project earned an honorable mention from the American Statistical Association for its creative and unique applications of statistics.