In this project, I will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population.
Throughout this project, I will be focusing on the following:
- Use unsupervised learning techniques to perform customer segmentation
- identifying the parts of the population that best describe the core customer base of the company.
- Apply learning on a third dataset with demographics information for targets of a marketing campaign for the company and use a model to predict which individuals are most likely to convert into becoming customers for the company.
The data for this project was provided by Arvato and cannot be shared publicly.
https://www.kaggle.com/c/udacity-arvato-identify-customers
- numpy==1.18.3
- pandas==0.23.4
- scikit-learn==0.22.2.post1
- matplotlib==3.0.3
- seaborn==0.9.1
-
Arvato Project Workbook.ipynb
The notebook is divided into 3 major segments:
Part 0: Get to Know the Data : In this part I have a look at the data and perform necessary data preprocessing steps like handling missing values, scaling the data and modifying column names.
Part 1: Customer Segmentation Report : Performed PCA and k-means to describe the relationship between the demographics of the company's existing customers and the general population of Germany.
Part 2: Supervised Learning Model : Here I have tested and finalized a classification model for prediction. Various models were tried and GridSearchCV was used for hypertuning paramteres for the final model. -
Helper.py
This file contains helper methods to perform analysis above. It contains data preprocessing, plotting and gridsearch implementations.
After training multiple machine learning models and comparing their results, CatBoost Classifier achieved the best results with ROC AUC score of 0.80028
For detailed result analysis read the below Medium article:
Medium post :
https://medium.com/@malhotra.vaibhav0304/effectively-target-customers-use-data-for-customer-segmentation-fb6425b593fd