The local machine (Linux) requires JAVA, Hadoop and Spark to be installed and configured.
Add environment variables if not have been configured:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export SPARK_HOME=/home/{user}/spark-3.1.2-bin-hadoop3.2/
export PYSPARK_PYTHON=python3
Start the local ssh :
sudo service ssh start
Start NameNode daemon and DataNode daemon:
hadoop/hadoop-3.3.1/sbin/start-dfs.sh
Clone the project directory:
https://github.com/smithakolan/NFT-Big-Data-Analysis.git
Command to run file: ETL\collect_stats.py
time ${SPARK_HOME}/bin/spark-submit ETL\collect_stats.py
The program produces HDFS folder called DAppStats
Command to run file: ETL\getNFTs.py
time ${SPARK_HOME}/bin/spark-submit ETL\getNFTs.py
The program produces a HDFS folder called rawnftdata. The file from this folder is acquired and converted into a json file called stats.json
Command to run file: ETL\transformNFT.py
time ${SPARK_HOME}/bin/spark-submit ETL\transformNFT.py
The program produces a file called nfts.json
An AWS account has to be created and a Administrator User account should be created before proceeding to the next step. After creation, the AWS ACCESS_ID and ACCESS_KEY should be added to the ETL folder of the project as a python file.
Command to run file: ETL\loadStats.py
time ${SPARK_HOME}/bin/spark-submit ETL\loadStats.py
After the Stats table creation and insertion:
Command to run file: ETL\loadNFT.py
time ${SPARK_HOME}/bin/spark-submit ETL\loadNFT.py
After the NFTs table creation and insertion:
Command to run file: Data_Analysis\analyse_stats.py
time ${SPARK_HOME}/bin/spark-submit Data_Analysis\analyse_stats.py
The program produces two files as output which can be found on HDFS.
dapp_volume.json - Contains 1 Day, 7 Days and 30 Days Volume of NFT sold for the top 10 NFTs with respect to each of the Dapps.
dapp_optimality.json - Contains optimality score of Dapps
-
Run generate_nft_per_dapp_csv.py file in order to generate a csv file which will contain dapp names along with the number of NFTs in each dapp. csv file generated: nftperdapp.csv columns=['slug', 'NFTcount']
-
Run RarityCalculator.py which will calculate the rarity of each NFT. The output for this will be a csv file for each dapp in the nftperdapp.csv.
columns=['id','token_id', 'nft_name', 'image_url', 'slug', 'last_sale_total_price', 'rarity']
- When runinng RarityCalculator.py an additional csv called top5NFTs.csv is generated. It contains the top 5 NFTs per dapp.
columns=['slug', 'NFTcount', 'image_1', nft1_id', 'image_2', nft2_id', 'image_3', nft3_id', 'image_4', nft4_id', 'image_5', nft5_id']
- Run NFTPriceRegression.py which uses linear regression to generate the predicted price of each NFT within each dapp. This step makes use of the output generated from calculating the rarity in order to run.
python3 Data_Analysis\nft_correlation_analysis.py
The program produces a file called nft_corr.csv
Tableau visualization workbooks are located in Visualization_Tableau_Workbooks. Tableau Desktop/ Public/Reader must be downloaded to view the workbooks. Otherwise, see the full visualization here: https://public.tableau.com/app/profile/ha.do1817