Lab guide

Module 1: Azure Data Services - Integration Primer

1. Azure Storage
2. Azure Event Hub
3. Azure HDInsight Kafka
4. Azure SQL Database
5. Azure SQL Datawarehouse
6. Azure Cosmos DB
7. Azure Data Factory v2
8. Azure Key Vault

Module 2: Data Engineering - Primer

This module covers a simple data engineering batch pipeline in Spark Scala.

2.1. Data copy:

We will use Azure Data Factory v2 to copy data to Azure Blob Storage - staging directory. The instructions are here.

2.2. Load/parse/persist raw data:

2.2.a. Load reference data:

We will read raw CSV reference datasets (6 of them) in the staging directory (blob storage) and persist to Parquet in the curated information zone.

2.2.b. Load transactional yellow taxi trip data:

We will read raw CSV data in the staging directory (blob storage) and persist to Delta format in the raw information zone. We will dedupe the data, add some additional columns as a precursor to homogenizing schema across yellow and green taxi trips and across years.

2.2.c. Load transactional green taxi trip data:

We will read raw CSV data in the staging directory (blob storage) and persist to Delta format in the raw information zone. We will dedupe the data, add some additional columns as a precursor to homogenizing schema across yellow and green taxi trips and across years.

2.2.a/b/c can be run in parallel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Lab guide

Module 1: Azure Data Services - Integration Primer

Module 2: Data Engineering - Primer

2.1. Data copy:

2.2. Load/parse/persist raw data:

2.2.a. Load reference data:

2.2.b. Load transactional yellow taxi trip data:

2.2.c. Load transactional green taxi trip data:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Lab guide

Module 1: Azure Data Services - Integration Primer

Module 2: Data Engineering - Primer

2.1. Data copy:

2.2. Load/parse/persist raw data:

2.2.a. Load reference data:

2.2.b. Load transactional yellow taxi trip data:

2.2.c. Load transactional green taxi trip data: