We will be doing this assessment in GCP free tier.
Google Doc Link - https://docs.google.com/document/d/1aX0vTrG03R84NLSzyaQMzQtkMgKgRGPkdeCWAuPqF38
Data Source - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- Prerequisites -
- Setting Up Airflow Dag -
-
Now, upload all the files and subfolders in directory
dags
to the airflow dags folder so that the composer can pickup our dag. Also changetrusty-drive-434711-g9
to your project_id wherever it needs to be changed in all the files. -
After the dag is picked up successfully, it should be visible something like this -
-
Once you trigger a dag for a particular date say
2023-06-05
, it will pickup the files from source for June, 2023 month and load into the bigquery final tabletrip_data_consolidated
- Analysis on dataset -
- You can do analysis on the dataset similar to the one done in notebook file
/analysis/Exploratory_Data_Analysis.ipynb
in jupyter notebook. - Jupyter notebook can be accessed in the dataproc cluster
- Similar Services in Azure to do the same tasks -
- Google Cloud Storage (GCS) ≈ Azure Blob Storage
- Google Composer ≈ Azure Data Factory
- Google Dataproc ≈ Azure Synapse Analytics
We can use these similar servies in Azure to do same tasks as we did in GCP. To setup Airflow in Azure follow this nice tutorial - https://www.youtube.com/watch?v=pGZ5v7OMqhM