Skip to content

Latest commit

 

History

History
125 lines (90 loc) · 7.92 KB

README.md

File metadata and controls

125 lines (90 loc) · 7.92 KB

Crytolytics: Coincap Data Extraction and Analysis Pipeline

Introduction

In today's data-driven world, data plays a pivotal role in shaping decisions within organizations. The sheer volume of data generated necessitates data engineers to centralize data efficiently, clean and model data to align with specific business requirements, and also make the data easily accessible for data consumers.

The aim of this project is to build an automated data pipeline that retrieves cryptocurrency data from the CoinCap API, processes and transforms it for analysis, and presents key metrics on a near-real-time* dashboard. The dashboard provides users with valuable insights into the dynamic cryptocurrency market.

*near-real-time because the data is loaded from the source and processed every 5 minutes rather than instantly

Dataset

This data used in this project was obtained from the CoinCap API, which provides real-time pricing and market activity for over 1,000 cryptocurrencies.

Tools & Technologies used:

  • Cloud: Google Cloud Platform (GCP)
  • Infrastructure as Code (Iac): Terraform
  • Containerization: Docker, Docker Compose
  • Workflow Orchestration: Apache Airflow
  • Data Lake: Google Cloud Storage (GCS)
  • Data Warehouse: Big Query
  • Data Transformation: Data Build Tool (DBT)
  • Visualization: Looker Studio
  • Programming Language: Python (batch processing), SQL (data transformation)

Data Architecture

full data pipeline

Project Map:

  1. Provisioning Resources: Terraform is used to set up the necessary GCP resources, including a Compute Engine instance, GCS bucket, and BigQuery datasets
  2. Data Extraction: Every 5 minutes, JSON data is retrieved from the CoinCap API and converted to Parquet format for optimized storage and processing
  3. Data Loading: The converted data is stored in Google Cloud Storage, the data lake, and then loaded into BigQuery, the data warehouse.
  4. Data Transformation: DBT is connected to BigQuery to transform the raw data, after which the processed data is loaded back into BigQuery; with the entire ELT process automated and orchestrated using Apache Airflow
  5. Reporting: The transformed dataset is used to create an analytical report and visualizations in Looker Studio

Dashboard

Disclaimer: This is only a pet project. Please, do not use this dashboard for actual financial decisions. T for thanks!


How to Replicate the Data Pipeline

Below are steps on how to reproduce this pipeline in the cloud. Note, that, Windows/WSL/Gitbash was used locally for this project.

1. Set up Google Cloud Platform (GCP)

  • If you don't have a GCP account already, create a free trial account (you get free $300 credits) by following the steps in this guide
  • Create a new project on GCP (see guide) and take note of your Project ID, as it will be needed at the later stages of the project
  • Next is to enable necessary APIs for the project, create and configure a service account, and generate an auth-key. While all of these can be done via the GCP Web UI (see), Terraform will be used to run the processes (somebody say DevOps, hehehe). So skip for now.
  • If you haven't already, download and install the Google Cloud SDK for local setup. You can follow this installation guide.
    • You might need to restart your system before gcloud can be used via CLI. Check if the installation is successful by running gcloud -v in your terminal to view the version of the gcloud installed
    • Run gcloud auth login to authenticate the Google Cloud SDK with your Google account

2. Generate the SSH Key Pair Locally

The SSH Key will be used to connect and gain access to the gcp virtual machine via the local terminal (Linux). In your terminal run the command
ssh-keygen -t rsa -f ~/.ssh/<whatever-you-want-to-name-your-key> -C <the-username-that-you-want-on-your-VM> -b 2048

ex: ssh-keygen -t rsa -f ~/.ssh/ssh_key -C aayomide -b 2048

3. Provision the Needed GCP Resources via Terraform.

Follow the terraform reproduce guide

4. Create an SSH Connection to the newly created VM (on your local machine)

Create a file called config within the .ssh directory in your home folder and paste the following information:

HOST <vm-name-to-use-when-connecting>
    Hostname <external-ip-address>   # check the terraform output in the CLI or navigate to GCP > Compute Engine > VM instances.
    User <username used when running the ssh-keygen command>  # it is also the same as the gce_ssh_user
    IdentityFile <absolute-path-to-your-private-ssh-key-on-local-machine>
    LocalForward 8080 localhost:8080     # forward traffic from local port 8080 to port 8080 on the remote server where Airflow is running
    LocalForward 8888 localhost:8888     # forward traffic from local port 888 to port 8888 on the remote server where Jupyter Notebook is running

for example

HOST cryptolytics_vm
    Hostname 35.225.33.44
    User aayomide
    IdentityFile c:/Users/aayomide/.ssh/ssh_key
    LocalForward 8080 localhost:8080
    LocalForward 8888 localhost:8888

Afterward, connect to the virtual machine via your local terminal by running ssh cryptolytics_vm.

You can also access the VM via VS code as shown here

Note: the value of the external IP address changes as you turn the VM instance on and off

5: Setup DBT (data build tool).

Follow the dbt how-to-reproduce guide

6. Orchestrate the Dataflow with Airflow.

Follow the airflow how-to-reproduce guide

7. Create a Report in Looker Studio:

  • Log in to Looker Studio using your google account
  • Click on "Blank report" and select the "BigQuery" data connector
  • Choose your data source (project -> dataset), which in this case is "prod_coins_dataset"

Further Improvements

  • Use Apache Kafka to stream the data in real-time
  • Perform advanced data transformation using DBT or even PySpark
  • Implement more robust error handling with try-catch blocks and write more robust data quality tests in DBT
  • Pipeline alerting & monitoring feature

Resources