Sparkify Data Modeling with Postgres

Introduction

This repository contains the data modeling and ETL pipeline for Sparkify, a music streaming startup. The goal of this database is to support Sparkify's analytical needs by storing and organizing their data in a way that allows for efficient querying and analysis.

Running the Python Scripts

To run the Python scripts, you will need to have Python and Postgres installed on your machine. You will also need to create a database named sparkifydb. Once these prerequisites are met, you can run the following command to create the necessary tables and load the data:

python create_tables.py
python etl.py

Repository Contents

create_tables.py: This script is used to create the necessary tables in the sparkifydb database.
etl.py: This script is used to extract data from JSON logs and song data files, transform the data, and load it into the appropriate tables in the sparkifydb database.
sql_queries.py: This script contains the SQL queries used in the create_tables.py and etl.py scripts.
data/: This directory in data.zip contains the JSON logs and song data files that are used as input for the ETL pipeline.
test.ipynb: Jupyter notebook that contains test case to check if the data is loaded into tables correctly.
ERD.py: This script creates the ERD for the sparkifydb database.

Database Schema Design and ETL Pipeline

The database schema consists of a star schema with one fact table (songplays) and four dimension tables (users, songs, artists, time). The fact table, songplays, contains information about each song play and references the primary keys of the dimension tables. This design allows for efficient querying and analysis of the data, as all relevant data for a song play is contained in one row of the songplays table and the related data can be easily accessed by joining with the dimension tables.
The ETL pipeline extracts data from the JSON logs and song data files, transforms the data (such as converting timestamps to a more usable format), and loads it into the appropriate tables in the sparkifydb database. The pipeline is designed to be flexible and able to handle new data being added to the input files without disrupting the existing data in the database.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify Data Modeling with Postgres

Introduction

Running the Python Scripts

Repository Contents

Database Schema Design and ETL Pipeline

Entity Relationship Diagram

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
ERD.py		ERD.py
README.md		README.md
create_tables.py		create_tables.py
data.zip		data.zip
etl.ipynb		etl.ipynb
etl.py		etl.py
sparkifydb_erd.png		sparkifydb_erd.png
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Marcoc51/Sparkify-Data-Modeling-with-Postgres

Folders and files

Latest commit

History

Repository files navigation

Sparkify Data Modeling with Postgres

Introduction

Running the Python Scripts

Repository Contents

Database Schema Design and ETL Pipeline

Entity Relationship Diagram

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages