Skip to content

With Amazon EMR and machine learning techniques supported by PySpark, a model was built to assist the fictitious music streaming service provider to predict customer churn rate based on user click data.

Notifications You must be signed in to change notification settings

timchansdp/Churn-Prediction-with-PySpark

Repository files navigation

Customer Churn Prediction for Music Streaming Business with Apache Spark

Table of Contents

  1. Project Motivation
  2. Dependencies
  3. File Structure & Description
  4. Summary of Results
  5. Acknowledgements

1. Project Motivation

Streaming businesses like Spotify, Pandora, or Netflix often provide services to customers on a month-to-month basis, which is more seductive for them. Customers can use the high-quality services at a low cost and have a secure feeling that they can cancel the service anytime if unsatisfied.

However, this autonomy could be a burden for the service providers. Streaming businesses often undergo a high turnover of users. Correctly predicting which customers are likely to cancel the service could minimize the churn rate, which would be valuable to the business. It is a well-known fact that the cost of acquiring a new customer is often much higher than retaining an existing one. If such users are accurately identified in advance, service providers can offer them incentives for staying and potentially save millions in revenues.

To help boosting the businesses of Sparkify, a fictitious music streaming app, this project tried to tackle the abovementioned problems through machine learning techniques.

2. Dependencies

The code is developed with Python 3.7.12 and is dependent on python packages listed as below:

  • matplotlib == 3.2.2
  • numpy == 1.19.5
  • pandas == 1.1.5
  • pyspark == 3.1.2
  • seaborn == 0.11.2

3. File Structure & Description

|-- images # contains visualizations generated by Mini-dataset_Submission.ipynb
|-- mini_sparkify_event_data.json # a 128MB subset of the full 12GB dataset
|-- Full_dataset_AWS-EMR.ipynb # the model develpoment with the 12GB full dataset & Amazon Elastic MapReduce
|-- Full_dataset_AWS-EMR.html # model develpoment with the 12GB full dataset & Amazon Elastic MapReduce
|-- Mini_dataset_Submission.ipynb # model develpoment with the 128MB mini dataset
|-- README.md

4. Summary of Results

The results are presented in a medium blog post available here.

5. Acknowledgements

Udacity is credited with simulating the data used in this project. Udacity imitated the data generated by the real-world music streaming service provider.

About

With Amazon EMR and machine learning techniques supported by PySpark, a model was built to assist the fictitious music streaming service provider to predict customer churn rate based on user click data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published