This a solution notebook to an assignment question given in a Data Mining graduate course. Each code block is accompanied by relevant analysis wherever required.
Dataset link: https://archive.ics.uci.edu/ml/datasets/seeds
Broadly, the following steps have been performed in this solution notebook:
- Minimal preprocessing on the dataset
- Explained limitations of KMeans
- Suggested two existing algorithms (KMedoids and CLARANS) that use some technique to mitigate limitations of KMeans
- Visualization of given class labels using TSNE
- Ran KMedoids and CLARANS on the seeds dataset and reported the best results
obtained on various cluster validity indices.
- Further compared the results with KMeans.
- Reported and visualized the hyperparameter tuning for KMedoids and CLARANS required to achieve the best results obtained on the seeds dataset
These above assumptions and the flow of work is according to the questions asked in assignment.