DWH-S3-Spark

This folder contains the necessary program files to create Sparkify Parquet files for analytics to read.

data
- Folder contains sample data.
etl.ipynb
- Jupyter notebook file used to build the etl process step by step.
etl.py
- reads and processes files for song_data and log_data and loads them into the parquet files.
aws
- Folder contains configuartion details to create spark session on demand
README.md
environment.yaml
- conda environment file to import the python environment used by the project.

Installation

Use the following command to clone the project repository.

git clone https://github.com/shilpamadini/DWH-S3-Spark.git

Create the environment using below command
```
conda env create -f environment.yaml
```
Activate the conda environment
```
source activate dand_py3
```
Navigate to the project directory and run the following to create tables
```
python etl.py
```

Functionality

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

This project aims to build an ETL pipeline that extracts their data from S3, process them using spark, and transforms data into a set of dimensional tables for the analytics team to continue finding insights on the data. These tables are loaded as parquet files into s3 bucket for analytics teams to access.

Analytics are interested in knowing what songs the users are listening to and performing ranking , aggregation on the data to determine which song is played the most, what is most popular song, which artist released most popular songs. Analytics may also be interested in looking at the trends over a period of time.

In order to support the required analytics a star schema design is implemented to design the data warehouse. Songplay table is the fact table and song, user,artist and time are dimension tables.

Here is the ER diagram explaining the schema design.

Songs data is partitioned by year and artist_id. Time data is partitioned by year and month. Song Play data is partioned by year and month.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DWH-S3-Spark

Contents

Installation

Functionality

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
aws		aws
data		data
README.md		README.md
environment.yaml		environment.yaml
etl.ipynb		etl.ipynb
etl.py		etl.py

shilpamadini/DWH-S3-Spark

Folders and files

Latest commit

History

Repository files navigation

DWH-S3-Spark

Contents

Installation

Functionality

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages