This folder contains the necessary program files to create Sparkify Parquet files for analytics to read.
- data
- Folder contains sample data.
- etl.ipynb
- Jupyter notebook file used to build the etl process step by step.
- etl.py
- reads and processes files for song_data and log_data and loads them into the parquet files.
- aws
- Folder contains configuartion details to create spark session on demand
- README.md
- environment.yaml
- conda environment file to import the python environment used by the project.
-
Use the following command to clone the project repository.
git clone https://github.com/shilpamadini/DWH-S3-Spark.git
-
Create the environment using below command
conda env create -f environment.yaml
-
Activate the conda environment
source activate dand_py3
-
Navigate to the project directory and run the following to create tables
python etl.py
A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
This project aims to build an ETL pipeline that extracts their data from S3, process them using spark, and transforms data into a set of dimensional tables for the analytics team to continue finding insights on the data. These tables are loaded as parquet files into s3 bucket for analytics teams to access.
Analytics are interested in knowing what songs the users are listening to and performing ranking , aggregation on the data to determine which song is played the most, what is most popular song, which artist released most popular songs. Analytics may also be interested in looking at the trends over a period of time.
In order to support the required analytics a star schema design is implemented to design the data warehouse. Songplay table is the fact table and song, user,artist and time are dimension tables.
Here is the ER diagram explaining the schema design.
Songs data is partitioned by year and artist_id. Time data is partitioned by year and month. Song Play data is partioned by year and month.