- Introduction
- About Dataset
- Exploratory Data Analysis
- Data Preprocessing
- Preparation for Training Model
- Building Model
- Model Evaluation
- Conclusion
Customer Churn prediction means knowing which customers are likely to leave or unsubscribe from your service. This is an important decision for many businesses, including banks.
Why is it important?
- Obtaining new customers costs more than retaining existing ones.
- It reduces revenues and profits.
- It decreases chances to grow business.
- ...
In this project, I am going to come up with some hypotheses and then buil classification model to predict customer churn.
The dataset contains 10000 rows and 18 columns with information to predict customer churn.
- Personal Information (Name, Age, Geography, Gender, Estimated Salary)
- Status (Credit Score, Has Credit Card, Tenure, Balance, Number of Products)
- Satisfaction (Complain, Satisfaction Score)
Name of the dataset: Bank Customer Churn
Format: .csv
Source: Kaggle
Link: https://www.kaggle.com/datasets/radheshyamkollipara/bank-customer-churn
Overall, about 20% of the total customers leave a bank.
Hypothesis 1: Would elderly customers tend to leave a bank than younger ones as they do not feel safe when depositing money there?
- Customers' ages range from 18 to 92 and are seperated into 4 groups.
- The chart shows that the 1st hypothesis is true when the elderly has the highest churn rate (44%).
Hypothesis 2: Whether customers who lives in different country leave a bank since they find it hard to reach it?
- Customers come from 3 different countries in Europe: Spain, France and Germany.
- It is clearly seen that most customers leaving a bank are from Germany, about 32% which is two times higher than that of the others.
- Consider the tenure, there is no differences between them.
- However, a group of customers who stay for 6-8 years has the lowest churn rate (18%).
- The maximum number of products that a customer is likely to buy is 4.
- The output above shows that most customers buy 2 products, while the number of customers purchase 3 or 4 products is much lower. Thus, the exit rate in group of 3-4 products is higher than that of 1-2.
- However, customers who buy 4 products completely quit a bank (100%).
- There are 4 types of card: SILVER, GOLD, PLATINUM and DIAMOND.
- The chart shows no matter how high the tier the card is, the customer still decide to leave.
- The points earned by customers range from 119 to 1000.
- The earned points do not effect on customer churn.
Hypothesis 7: Are there any differences of churn rate between groups of customers when combining Point Earned and Card Type?
- A combination of Point Earned and Card Type shows noticeable gaps between groups.
- The lowest satisfaction score is 1, while the highest score is 5.
- The gap between satisfaction scores shows no differences.
Hypothesis 9: Customers complain because they are not happy with a bank or they want to give feedback to expect a bank improves its service?
- The number of customers who give positive feedback (about 8000) is nearly 4 times higher than that of the ones complain (just over 2000).
- However, a huge majority of customers making complaints decide to leave a bank (99%).
- The output above presents no differences of churn rate between customers have a credit card or not.
- Balance is ranged from 0 to 250080 and divided into 4 groups.
- However, the number of customers in a group of 0-73K is two times higher than that of others.
- It is obviously shown that customers who have lower balance would stay longer than whom have higher balance.
- About 26% of customer having balance about 110-134K leave a bank.
- It seems that credit scores do not impact on the decision to leave of customers.
def replace_cardtype(df, col):
'''
This function is used for replacing a sorted column.
Parameters:
df: data frame
col: a column that needs replacing values
'''
df[col] = df[col].replace({'SILVER':0, 'GOLD':1, 'PLATINUM':2, 'DIAMOND':3})
return df
def encode(df, col):
'''
This function is used for transforming categorical variables to numeric values.
Parameters:
df: data frame
col: a column that needs encoding.
'''
df = pd.get_dummies(df, col)
return df
def scale_data(df, list_of_cols):
'''
This function is used for scaling data.
Parameters:
df: data frame
list_of_cols: a list of columns that need scaling
'''
scaler = StandardScaler()
df[list_of_cols] = scaler.fit_transform(df[list_of_cols])
return df
def imbalance_process(df, label):
X = df.drop(label, axis=1)
y = df[label]
smote = SMOTE(random_state=42)
X_resample, y_resample = smote.fit_resample(X, y)
return X_resample, y_resample
Create a new data frame that contains relative features mentioned on the hypothesis above for modeling
df_model = df.loc[:,['Age','Geography', 'Tenure', 'NumOfProducts', 'Card Type', 'Point Earned', 'Complain', 'Balance', 'Exited']]
df_model = replace_cardtype(df_model, 'Card Type')
df_model = encode(df_model, 'Geography')
cols_to_scale = ['Age', 'Tenure', 'NumOfProducts', 'Complain', 'Point Earned', 'Balance']
df_model = scale_data(df_model, cols_to_scale)
X_resample, y_resample = imbalance_process(df_model, 'Exited')
- y variable: 'Exited'
- X variables: other features in a df_model.
X_train, X_test, y_train, y_test = split_data(df_model, 'Exited')
model = LogisticRegression(random_state=43)
model.fit(X_train, y_train)
Metrics that is used to evaluate model:
- Accuracy
- Confusion Matrix
- It can be seen that most customers who left were not satisfied with a bank as the feature "Complain" shows a strong impact.
- Most customers leave a bank are from Germany, the bank may not be located near this country to reach their customers.
- According to the EDA part, elderdy customers were likely to leave a bank, about 44.6% of a total exited elderdy . Perhaps they found it hard to use bank services such as internet banking. Or they may feel unsafe when they invest money in a bank.
- The top 4 effect is earned points. The data is opposite to the hypothesis that customers who earned high points (801-1000) would be likely to leave a bank than those having lower points. Perhaps they earned high points but they did not have any benefits from that, so they leave a bank.
- Balance of customers is also the relatively strong feature that effects churn rate.
- Card Type shows no differences according to the chart in an EDA part. Thus, this feature stays at the bottom of a plot feature importances.
- The number of purchased products does not impact on an exit rates. As most exited customers left while they bought 4 products at bank. Whereas customers buying 2 products would stay longer.
- As it is observed, the gap between different tenure groups is not significant. However, customers who stayed at a bank for 6-8 years tended to leave a bank.