Skip to content

nadyavoynich/DataEngineering-ND-DataModeling-Cassandra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Designing a Cassandra Database for Sparkify's Music Streaming Analytics

Description

This project involves creating a NoSQL database using Apache Cassandra for Sparkify, a startup focusing on music streaming. The aim is to analyze song and user activity data collected on their app, and provide a seamless way to query this data to understand user preferences.

Table of Contents

Installation

  • Python 3.7+
  • Apache Cassandra
  • Cassandra Python Driver

Usage

  1. Clone this repository.
  2. Execute Data_Modeling_with_Cassandra.ipynb to preprocess the data and interact with the database.

Project Overview

This project entails creating tables in Apache Cassandra to facilitate efficient querying on song play data for Sparkify’s analytics team. The ETL pipeline is developed using Python, and it processes data residing in a directory of CSV files to create a streamlined CSV file, which is then used to insert data into Apache Cassandra tables.

Datasets

The dataset used is event_data, which is a collection of CSV files partitioned by date. It contains details like artist name, user name, song details, user location, etc. After processing these files, the denormalized data appear as follows: Sample of the denormalized data

Project Steps

  1. Develop an ETL pipeline to process and transform event_data files to create a denormalized dataset.
  2. Create the Apache Cassandra database.
  3. Model the database tables based on the required queries.
  4. Create the tables and load the data into them.
  5. Run the provided queries to verify the model's effectiveness in answering analytics queries.

Files

  • Sparkify-Project-Notebook.ipynb - Jupyter notebook containing ETL pipeline, Apache Cassandra database, tables setup, and test queries.
  • event_datafile_new.csv - The preprocessed CSV file, generated by combining event_data files.

Acknowledgments

This project is part of the Data Engineering Nanodegree Program provided by Udacity.

About

Modelling a NoSQL database for a music streaming app's analytics.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published