Designing a Cassandra Database for Sparkify's Music Streaming Analytics

Description

This project involves creating a NoSQL database using Apache Cassandra for Sparkify, a startup focusing on music streaming. The aim is to analyze song and user activity data collected on their app, and provide a seamless way to query this data to understand user preferences.

Installation

Python 3.7+
Apache Cassandra
Cassandra Python Driver

Usage

Clone this repository.
Execute Data_Modeling_with_Cassandra.ipynb to preprocess the data and interact with the database.

Project Overview

This project entails creating tables in Apache Cassandra to facilitate efficient querying on song play data for Sparkify’s analytics team. The ETL pipeline is developed using Python, and it processes data residing in a directory of CSV files to create a streamlined CSV file, which is then used to insert data into Apache Cassandra tables.

Datasets

The dataset used is event_data, which is a collection of CSV files partitioned by date. It contains details like artist name, user name, song details, user location, etc. After processing these files, the denormalized data appear as follows:

Project Steps

Develop an ETL pipeline to process and transform event_data files to create a denormalized dataset.
Create the Apache Cassandra database.
Model the database tables based on the required queries.
Create the tables and load the data into them.
Run the provided queries to verify the model's effectiveness in answering analytics queries.

Files

Sparkify-Project-Notebook.ipynb - Jupyter notebook containing ETL pipeline, Apache Cassandra database, tables setup, and test queries.
event_datafile_new.csv - The preprocessed CSV file, generated by combining event_data files.

Acknowledgments

This project is part of the Data Engineering Nanodegree Program provided by Udacity.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
event_data		event_data
images		images
README.md		README.md
Sparkify-Project-Notebook.ipynb		Sparkify-Project-Notebook.ipynb
event_datafile_new.csv		event_datafile_new.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Designing a Cassandra Database for Sparkify's Music Streaming Analytics

Description

Table of Contents

Installation

Usage

Project Overview

Datasets

Project Steps

Files

Acknowledgments

About

Releases

Packages

Languages

nadyavoynich/DataEngineering-ND-DataModeling-Cassandra

Folders and files

Latest commit

History

Repository files navigation

Designing a Cassandra Database for Sparkify's Music Streaming Analytics

Description

Table of Contents

Installation

Usage

Project Overview

Datasets

Project Steps

Files

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages