Skip to content

Latest commit

 

History

History
33 lines (23 loc) · 1.71 KB

File metadata and controls

33 lines (23 loc) · 1.71 KB

Lab guide

Module 1: Azure Data Services - Integration Primer

1. Azure Storage
2. Azure Event Hub
3. Azure HDInsight Kafka
4. Azure SQL Database
5. Azure SQL Datawarehouse
6. Azure Cosmos DB
7. Azure Data Factory v2
8. Azure Key Vault

Module 2: Data Engineering - Primer

This module covers a simple data engineering batch pipeline in Spark Scala.

2.1. Data copy:

We will use Azure Data Factory v2 to copy data to Azure Blob Storage - staging directory. The instructions are here.

2.2. Load/parse/persist raw data:

2.2.a. Load reference data:

We will read raw CSV reference datasets (6 of them) in the staging directory (blob storage) and persist to Parquet in the curated information zone.

2.2.b. Load transactional yellow taxi trip data:

We will read raw CSV data in the staging directory (blob storage) and persist to Delta format in the raw information zone. We will dedupe the data, add some additional columns as a precursor to homogenizing schema across yellow and green taxi trips and across years.

2.2.c. Load transactional green taxi trip data:

We will read raw CSV data in the staging directory (blob storage) and persist to Delta format in the raw information zone. We will dedupe the data, add some additional columns as a precursor to homogenizing schema across yellow and green taxi trips and across years.

2.2.a/b/c can be run in parallel.