Streaming businesses like Spotify, Pandora, or Netflix often provide services to customers on a month-to-month basis, which is more seductive for them. Customers can use the high-quality services at a low cost and have a secure feeling that they can cancel the service anytime if unsatisfied.
However, this autonomy could be a burden for the service providers. Streaming businesses often undergo a high turnover of users. Correctly predicting which customers are likely to cancel the service could minimize the churn rate, which would be valuable to the business. It is a well-known fact that the cost of acquiring a new customer is often much higher than retaining an existing one. If such users are accurately identified in advance, service providers can offer them incentives for staying and potentially save millions in revenues.
To help boosting the businesses of Sparkify, a fictitious music streaming app, this project tried to tackle the abovementioned problems through machine learning techniques.
The code is developed with Python 3.7.12 and is dependent on python packages listed as below:
- matplotlib == 3.2.2
- numpy == 1.19.5
- pandas == 1.1.5
- pyspark == 3.1.2
- seaborn == 0.11.2
|-- images # contains visualizations generated by Mini-dataset_Submission.ipynb
|-- mini_sparkify_event_data.json # a 128MB subset of the full 12GB dataset
|-- Full_dataset_AWS-EMR.ipynb # the model develpoment with the 12GB full dataset & Amazon Elastic MapReduce
|-- Full_dataset_AWS-EMR.html # model develpoment with the 12GB full dataset & Amazon Elastic MapReduce
|-- Mini_dataset_Submission.ipynb # model develpoment with the 128MB mini dataset
|-- README.md
The results are presented in a medium blog post available here.
Udacity is credited with simulating the data used in this project. Udacity imitated the data generated by the real-world music streaming service provider.