Skip to content

Unsupervised Machine Learning Technique - KMeans Clustering to classify cryptocurrency data using Principle Component Analysis (PCA)to to reduce the number of dimensions of the scaled data.

Notifications You must be signed in to change notification settings

ramya-ramamur/Cryptocurrencies

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unsupervised Machine Learning Technique - KMeans Clustering to classify cryptocurrency data using Principle Component Analysis (PCA)to to reduce the number of dimensions of the scaled data.

Cryptocurrencies

Overview

The purpose of this project is to use unsupervised machine learning techniques to analyze cryptocurrency data. The original cryptocurrency data from CryptoCompare is preprocessed using Pandas to fit Unsupervised Machine Learning models. A clustering algorithm is used to group data and hvPlot visualization are used to create a report that includes the cryptocurrencies currently on the trading market and how they could be grouped to create a classification system for the new investment.

Resources

  • Data Source: crypto_data.csv
  • Software: Python 3.8.8, Pandas Dataframe, Jupyter Notebook 6.4.6, Anaconda Navigator 2.1.1, imbalanced-learn, skikit-learn

Analysis & Results

Preprocessing the database : Used pandas to reduce dataset of 1,252 cryptocurrencies to 532 that could be used for machine learning.

  • Remove non active cryptocurrencies and cryptocurrencies that doesn't have an algorithm
  • Remove Trading status column, incomplete Data cryptocurrencies, any cryptocurrencies that hasn't been mined

Screen Shot 2022-03-04 at 12 10 53 PM

  • Extract Coin Name out and hold separately

Screen Shot 2022-03-04 at 1 04 43 PM

  • Use get_dummies method to distinguish algorithms into own features

Screen Shot 2022-03-04 at 1 07 55 PM

  • Standardize the data with StandardScaler()

Screen Shot 2022-03-04 at 1 10 51 PM

Use Principle Component Analysis (PCA) - SKLearn to reduce the scaled data to three components for three dimensional modeling to thin down meaningful components.

PCA is a statistical technique to speed up machine learning algorithms when the number of input features (or dimensions) is too high. PCA reduces the number of dimensions by transforming a large set of variables into a smaller one that contains most of the information in the original large set.

PCA

Clustering cryptocurrencies using K-Means (Use elbow curve to best K value(4) for K-means method)

The objective of K-means is to group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset. These clusters are then determined by the means of all the points that will belong to the cluster.

The K-means algorithm groups the data into K clusters, where belonging to a cluster is based on some similarity or distance measure to a centroid. A centroid is a data point that is the arithmetic mean position of all the points on a cluster. The centroid is found by taking the mean of all the x values in a cluster, and the mean of all the y values in a cluster.

Elbow Curve

The best k value appears to be 4 so we used 4 clusters to categorize the crytocurrencies.

Elbow Curve

Dataframe after clustering.

The column "Class" shows the cluster labels of the crypto.

df_after_clustering_with_KMeans

Visualizing Cryptocurrencies Results

3D scatter plot with Plotly Express

3d_scatter_plot_clusters

Table with tradable cryptocurrencies using the hvplot.table() function.

hvplot_table_with_tradable_cryptocurrencies

A hvplot scatter plot with x="TotalCoinsMined", y="TotalCoinSupply", and by="Class"

For this we will first make a new DataFrame that has the scaled data with the clustered_df DataFrame index.

Screen Shot 2022-03-04 at 12 32 27 PM

hvplot_scatter_totalcoinsmined_vs_totalcoinsupply

Summary

Classification of 532 cryptocurrencies based on similarities of their features has resulted in dividing them into four classes. We can see from the 3d Scateer plot that classes 3 and 1 contain most of the cryptos that are similar in features while classes 0 and 2 contain cryptos that are outliers. The investment banks should concentrate on the the individuality of these classes to determined their performance and decide the profitabilty of investing in them.

About

Unsupervised Machine Learning Technique - KMeans Clustering to classify cryptocurrency data using Principle Component Analysis (PCA)to to reduce the number of dimensions of the scaled data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published