Sparkify Cassandra ETL Project

Introduction:

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. There is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app. My role is to create an Apache Cassandra database which can create queries on song play data to answer the questions.

Project Overview:

In this project, I would be applying Data Modeling with Apache Cassandra and complete an ETL pipeline using Python. I am provided with part of the ETL pipeline that transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables.

Datasets:

For this project, you'll be working with one dataset: event_data. The directory of CSV files partitioned by date. Here are examples of filepaths to two files in the dataset: event_data/2018-11-08-events.csv event_data/2018-11-09-events.csv

Objective

You will learn three most useful concepts from this project

Data modeling with Cassandra (NoSQL)
Database DeNormalisation
Building ETL PipeLine using python(csv),cassandra

Design the ETL

Implement the logic in section Part I of the notebook template to iterate through each event file in event_data to process and create a new CSV file in Python
Make necessary edits to Part II of the notebook template to include Apache Cassandra CREATE and INSERT three statements to load processed records into relevant tables in your data model
Test by running three SELECT statements after running the queries on your database
Finally, drop the tables and shutdown the cluster

The project file structure

We have a small list of files, easy to maintain and understand the Concept:

Project_1B_Project_Template.ipynb: This was template file provided to fill in the details and write the python script

Project_NoSQL.ipynb: This is the final file provided in which all the queries have been written with importing the files, generating a new csv file and loading all csv files into one. All verifying the results whether all tables had been loaded accordingly as per requirement

Event_datafile_new.csv: This is the final combination of all the csv files which are in the folder event_data

Event_Data Folder: Each event file is present separately, so all the files would be combined into one into event_datafile_new.csv

Author

Aditya Dhanraj - Linkedin Profile.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Project_NoSQL.ipynb		Project_NoSQL.ipynb
README.md		README.md
event_datafile_new.csv		event_datafile_new.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify Cassandra ETL Project

Introduction:

Project Overview:

Datasets:

Objective

Design the ETL

The project file structure

Author

About

Releases

Packages

Languages

aditya-dhanraj/Data-Modelling-with-Apache-Cassandra

Folders and files

Latest commit

History

Repository files navigation

Sparkify Cassandra ETL Project

Introduction:

Project Overview:

Datasets:

Objective

Design the ETL

The project file structure

Author

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages