Sparkify Capstone Project

At this project, we predict churn for a fictional streaming service called Sparkify.

The complete analysis and discussion are available here.

Do you want to churn?

Requirements

Python 3
The comple list of requirents can be found at requirements.txt

Project Overview

At this project, we try to predict if a user will churn (Canceling the service) given some information regarding the service interactions.

In the future, we can try to predict if the user will Downgrade the subscription (became a free user).

I've applied four classification algorithms and several techniques to work with the data.

To allow an easy visualization, I've hosted the notebook HTML files online, so you will see a static version, without the need to open the .ipynb at the GitHub (which is usually slow and hangs for large files).

To a more complete analysis, I recommend check my article here.

File Descriptions

.
├── results/ # Folder with the static version of the notebooks - Also hosted online (See at the project overview)
├── pyspark.sh # The file to run and config PySpark for local mode
├── visualizations.py # The implementation of some visualizations on Plotly, for a more interactive heatmap (See on the data exploration notebook)
├── jupyter_utils.py # A script to config pandas for a standard view between all the notebooks
├── Data Exploration.ipynb # Notebook with the exploration of the raw data and after the feature engineering
├── Sparkify.ipynb # Notebook with the exploration and feature engineering
├── Results - Spark.ipynb # Notebook with all the visualizations related to the results of training and the GridSearch
├── requirements.txt # The project dependencies

Dataset:

The dataset was given by Udacity. I've hosted on my S3 to make it more comfortable to download and work between my environments.

The full dataset is available here - 12Gb:
I've created a small version without some columns (firstName, lastName, location, userAgent), but with all the events - available here - 2Gb.

Raw Dataset features

ts: Event timestamp in milliseconds
gender: M or F
firstName: First name of the user
lastName: Last name of the user
length: Length of the song
level: Level of subscription free or paid.
registration: User registration timestamp
userId: User id at the service
auth: If the user is logged
page: Action of the event (next song, thumps up, thumbs down)
sessionId: Id of the session
location: Location of the event
userAgent: Browser/web agent of the event
song: Name of the song
artist: Name of the artist
method: HTTP method of the event
status: HTTP status of the request (200, 404)

Running the project

To run the training code, you can run the pyspark.sh. file, then just go to the Sparkify notebook. The next step is to decide which version of the data will fit you. For example, there are three variations of the load file. The medium dataset, the entire dataset as JSON, or the entire dataset as CSV (my version of it, as I've mentioned at the files section).

To running locally, I recommend downloading the dataset, so you won't need to download each time with Spark.

If you want to run the visualizations, don't forget to install the requirements, especially the plotly lib.

Results

Best results for each model

Heatmap - Absolute value

Heatmap - Feature vs. music listening time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify Capstone Project

Table of Contents

Requirements

Project Overview

File Descriptions

Dataset:

Running the project

Results

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
results		results
.gitignore		.gitignore
Data Exploration.ipynb		Data Exploration.ipynb
README.MD		README.MD
Results - Spark.ipynb		Results - Spark.ipynb
Sparkify.ipynb		Sparkify.ipynb
jupyter_utils.py		jupyter_utils.py
pyspark.sh		pyspark.sh
requirements.txt		requirements.txt
visualizations.py		visualizations.py

brunowdev/sparkify

Folders and files

Latest commit

History

Repository files navigation

Sparkify Capstone Project

Table of Contents

Requirements

Project Overview

File Descriptions

Dataset:

Running the project

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages