Unsupervised Machine Learning Technique - KMeans Clustering to classify cryptocurrency data using Principle Component Analysis (PCA)to to reduce the number of dimensions of the scaled data.
The purpose of this project is to use unsupervised machine learning techniques to analyze cryptocurrency data. The original cryptocurrency data from CryptoCompare is preprocessed using Pandas to fit Unsupervised Machine Learning models. A clustering algorithm is used to group data and hvPlot visualization are used to create a report that includes the cryptocurrencies currently on the trading market and how they could be grouped to create a classification system for the new investment.
- Data Source: crypto_data.csv
- Software: Python 3.8.8, Pandas Dataframe, Jupyter Notebook 6.4.6, Anaconda Navigator 2.1.1, imbalanced-learn, skikit-learn
Preprocessing the database : Used pandas to reduce dataset of 1,252 cryptocurrencies to 532 that could be used for machine learning.
- Remove non active cryptocurrencies and cryptocurrencies that doesn't have an algorithm
- Remove Trading status column, incomplete Data cryptocurrencies, any cryptocurrencies that hasn't been mined
- Extract Coin Name out and hold separately
- Use get_dummies method to distinguish algorithms into own features
- Standardize the data with StandardScaler()
Use Principle Component Analysis (PCA) - SKLearn to reduce the scaled data to three components for three dimensional modeling to thin down meaningful components.
PCA is a statistical technique to speed up machine learning algorithms when the number of input features (or dimensions) is too high. PCA reduces the number of dimensions by transforming a large set of variables into a smaller one that contains most of the information in the original large set.
The objective of K-means is to group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset. These clusters are then determined by the means of all the points that will belong to the cluster.
The K-means algorithm groups the data into K clusters, where belonging to a cluster is based on some similarity or distance measure to a centroid. A centroid is a data point that is the arithmetic mean position of all the points on a cluster. The centroid is found by taking the mean of all the x values in a cluster, and the mean of all the y values in a cluster.
The best k value appears to be 4 so we used 4 clusters to categorize the crytocurrencies.
The column "Class" shows the cluster labels of the crypto.
For this we will first make a new DataFrame that has the scaled data with the clustered_df DataFrame index.
Classification of 532 cryptocurrencies based on similarities of their features has resulted in dividing them into four classes. We can see from the 3d Scateer plot that classes 3 and 1 contain most of the cryptos that are similar in features while classes 0 and 2 contain cryptos that are outliers. The investment banks should concentrate on the the individuality of these classes to determined their performance and decide the profitabilty of investing in them.