Sparkify Cloud Data PipeLine

Introduction

This project creates a pipeline using Airflow to extract, transform and loads 5 main informations into redshift DB from the song and event datasets(json Files) uploaded in s3 bucket and will do a DQC(Data Quality Check) after Loading of data:

staging tables

staging_events
staging_songs

Analytical tables

songplays - (Fact Table)
users }
songs } - (Dimension Table)
artists }
time - (timestamps breakdown into comprehensible columns)

DAG Design

For illustration purposes you can check out the graph that represents this pipeline's flow: Database Schema!

Briefly talking about this ELT process:

Stages the raw data;
then transform the raw data to the songplays fact table;
and transform the raw data into the dimensions table too;
finally, check if the fact/dimensions table has at least one row.

Add Airflow Connections

Here, we'll use Airflow's UI to configure your AWS credentials and connection to Redshift.

To go to the Airflow UI:
Click on the Admin tab and select Connections.
Under Connections, select Create.
On the create connection page, enter the following values:

Conn Id : Conn Type: Login : Password : E

Once you've entered these values, select Save and Add Another.

On the next create connection page, enter the following values:

Conn Id: <Enter redshift>
Conn Type: <Enter Postgres>
Host: <Enter the endpoint of your Redshift cluster>
Schema: <Enter dev. This is the Redshift database you want to connect to>
Login: <Enter awsuser>
Password: <Enter the password>
Port: <Enter 5439>

Once you've entered these values, select Save.

Datasets

For this project, you'll be working with two datasets. Here are the s3 links for each:

Log data : s3://udacity-dend/log_data Song data: s3://udacity-dend/song_data

Author

Aditya Dhanraj - Linkedin Profile.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dags		dags
plugins		plugins
DAG.PNG		DAG.PNG
create_tables.sql		create_tables.sql
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify Cloud Data PipeLine

Introduction

staging tables

Analytical tables

DAG Design

Briefly talking about this ELT process:

Add Airflow Connections

On the next create connection page, enter the following values:

Datasets

Author

About

Releases

Packages

Languages

aditya-dhanraj/Data-PipeLine-With-Airflow

Folders and files

Latest commit

History

Repository files navigation

Sparkify Cloud Data PipeLine

Introduction

staging tables

Analytical tables

DAG Design

Briefly talking about this ELT process:

Add Airflow Connections

On the next create connection page, enter the following values:

Datasets

Author

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages