- This Udacity Nanodegree teaches how to model data for analytics at scale
- The following concepts are taught:
- Data Modelling
- ETL with PostgreSQL and Cassandra
- Amazon Web Services set-up: IAM, S3, Redshift, EMR instances
- Data pipelining with Spark
- Airflow
- After each lesson, the student has to build a project demonstrating his knowledge of the solution
- This repository display my personal propositions
- During this course, the student will build solutions for Sparkify, a fictional music streaming start-up.
- The data used is based on the Million Song Dataset
- The student will use different techniques to format that data into an analytical-ready dashboard
-
Understand the purpose of data modeling
-
Identify the strengths and weaknesses of different types of databases and data storage techniques
-
Understand when to use a relational database
-
Understand the difference between OLAP and OLTP databases
-
Create normalized data tables (3NF)
-
Implement denormalized schemas (e.g. STAR, Snowflake)
-
Understand when to use NoSQL databases and how they differ from relational databases
-
Create a table for a given use case. Select the appropriate primary key and clustering columns
-
Create a NoSQL database in Apache Cassandra
- Understand Data Warehousing architecture
- Run an ETL process to denormalize a database (3NF to Star)
- Create an OLAP cube from facts and dimensions
- Compare columnar vs. row oriented approaches
- Understand cloud computing
- Create an AWS account and understand their services
- Set up Amazon S3, IAM, VPC, EC2, RDS PostgreSQL
- Identify components of the Redshift architecture
- Run ETL process to extract data from S3 into Redshift
- Set up AWS infrastructure using Infrastructure as Code (IaC)
- Design an optimized table by selecting the appropriate distribution style and sorting key
- Use Spark to Run Code
In addition to the the content provided by the course, I did my own reseearch to come up with solutions.
Also, I used inspiration from other students of the Udacity Data Engineering NanoDegree, which I quote below.
Disclaimer : I did not copy-paste their code but compared my solution with theirs, and improved mine when I noticed theirs was better.
- Naresh Kumar
- Florencia Silvestre
- Sanchit Kumar
- Tran Nguyen
- And many thanks to the students on the Udacity Chat