Skip to content

sukhishdhawan/data-eng-assessment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

data-eng-assessment

We will be doing this assessment in GCP free tier.

Google Doc Link - https://docs.google.com/document/d/1aX0vTrG03R84NLSzyaQMzQtkMgKgRGPkdeCWAuPqF38

Data Source - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

  1. Prerequisites -
  • Setting up Airflow (Google Composer) Screenshot 2024-09-10 at 3 24 48 AM

  • Dataproc cluster Screenshot 2024-09-10 at 3 27 07 AM

  1. Setting Up Airflow Dag -
  • Now, upload all the files and subfolders in directory dags to the airflow dags folder so that the composer can pickup our dag. Also change trusty-drive-434711-g9 to your project_id wherever it needs to be changed in all the files. Screenshot 2024-09-10 at 3 33 10 AM

  • After the dag is picked up successfully, it should be visible something like this - Screenshot 2024-09-10 at 3 36 12 AM

  • Once you trigger a dag for a particular date say 2023-06-05 , it will pickup the files from source for June, 2023 month and load into the bigquery final table trip_data_consolidated

  1. Analysis on dataset -
  • You can do analysis on the dataset similar to the one done in notebook file /analysis/Exploratory_Data_Analysis.ipynb in jupyter notebook.
  • Jupyter notebook can be accessed in the dataproc cluster Screenshot 2024-09-10 at 3 42 46 AM
  1. Similar Services in Azure to do the same tasks -
  • Google Cloud Storage (GCS) ≈ Azure Blob Storage
  • Google Composer ≈ Azure Data Factory
  • Google Dataproc ≈ Azure Synapse Analytics

We can use these similar servies in Azure to do same tasks as we did in GCP. To setup Airflow in Azure follow this nice tutorial - https://www.youtube.com/watch?v=pGZ5v7OMqhM

About

Senior Data Engineer Hands-On Assessment Assignment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published