This project involves creating a NoSQL database using Apache Cassandra for Sparkify, a startup focusing on music streaming. The aim is to analyze song and user activity data collected on their app, and provide a seamless way to query this data to understand user preferences.
- Python 3.7+
- Apache Cassandra
- Cassandra Python Driver
- Clone this repository.
- Execute
Data_Modeling_with_Cassandra.ipynb
to preprocess the data and interact with the database.
This project entails creating tables in Apache Cassandra to facilitate efficient querying on song play data for Sparkify’s analytics team. The ETL pipeline is developed using Python, and it processes data residing in a directory of CSV files to create a streamlined CSV file, which is then used to insert data into Apache Cassandra tables.
The dataset used is event_data
, which is a collection of CSV files partitioned by date. It contains details like artist name, user name, song details, user location, etc.
After processing these files, the denormalized data appear as follows:
- Develop an ETL pipeline to process and transform
event_data
files to create a denormalized dataset. - Create the Apache Cassandra database.
- Model the database tables based on the required queries.
- Create the tables and load the data into them.
- Run the provided queries to verify the model's effectiveness in answering analytics queries.
Sparkify-Project-Notebook.ipynb
- Jupyter notebook containing ETL pipeline, Apache Cassandra database, tables setup, and test queries.event_datafile_new.csv
- The preprocessed CSV file, generated by combiningevent_data
files.
This project is part of the Data Engineering Nanodegree Program provided by Udacity.