1. Azure Storage
2. Azure Event Hub
3. Azure HDInsight Kafka
4. Azure SQL Database
5. Azure SQL Datawarehouse
6. Azure Cosmos DB
7. Azure Data Factory v2
8. Azure Key Vault
This module covers a simple data engineering batch pipeline in Spark Scala.
We will use Azure Data Factory v2 to copy data to Azure Blob Storage - staging directory. The instructions are here.
We will read raw CSV reference datasets (6 of them) in the staging directory (blob storage) and persist to Parquet in the curated information zone.
We will read raw CSV data in the staging directory (blob storage) and persist to Delta format in the raw information zone. We will dedupe the data, add some additional columns as a precursor to homogenizing schema across yellow and green taxi trips and across years.
We will read raw CSV data in the staging directory (blob storage) and persist to Delta format in the raw information zone. We will dedupe the data, add some additional columns as a precursor to homogenizing schema across yellow and green taxi trips and across years.
2.2.a/b/c can be run in parallel.