Final project for the Statistical Methods course of SDIC/DSAI Master degrees - UniTs 2023/2024
Name | Surname | Master |
---|---|---|
Sara | - | DSAI |
Giulio | Fantuzzi | DSAI |
Vishal | Nigam | DSAI |
Marco | Tallone | SDIC |
Alessio | Valentinis | DSAI |
- Create Repository
- Add README
- Download dataset
- R Markdown
- R Scripts Folder
- Add Contributors
- Preprocessing script
- Exploratory Analysis
- Feature selection and variables importance
- Logistic Regression
- K-fold CV
- Apply ROSE and check how models improve (if they improve)
- Check computation of AUC, fpr and fnr
- ABSOLUTELY CHANGE NAMES TO LEARN/PREDICT FUNCTIONS IF WE WILL CREATE A UNIQUE .Rmd
- Decide presentation scheduling and timings
- Just something I noticed: assess() can be assigned to an object; while cv functions not
The project's structure is the following.
.
├── datasets # Folder with datasets
│ └── BankChurners.csv
├── plots # Folder with saved R plots
│ └── plot.png
├── GroupB_Final.Rmd # Final R Markdown
├── README.md # This file
└── r_scripts # R scripts
└── bank.R
Predicting the Attrition Flag
response variable from the Credit Card costumers dataset available on Kaggle.
The dataset can be found in the datasets/
folder.
Other projects based on this dataset here.
Interesting notebook to look at here.
A manager at the bank is disturbed with more and more customers leaving their credit card services. They would really appreciate if one could predict for them who is gonna get churned so they can proactively go to the customer to provide them better services and turn customers' decisions in the opposite direction.
I got this dataset from the leaps.analyttica wesite. I have been using this for a while to get datasets and accordingly work on them to produce fruitful results. The site explains how to solve a particular business problem.
Now, this dataset consists of 10,000 customers mentioning their age, salary, marital_status, credit card limit, credit card category, etc. There are nearly 18 features.
We have only 16.07% of customers who have churned. Thus, it's a bit difficult to train our model to predict churning customers.
*~ source: Kaggle.*
Here is a brief summary of what the dataset contains.
Warning
PLEASE IGNORE THE LAST 2 COLUMNS (NAIVE BAYES CLAS…). I SUGGEST TO RATHER DELETE IT BEFORE DOING ANYTHING
Important
A business manager of a consumer credit card portfolio is facing the problem of customer attrition. They want to analyze the data to find out the reason behind this and leverage the same to predict customers who are likely to drop off.
Variables description:
Legend: 🆎: categorical, 🔢: numerical, 🔀: binary
# | Variable Name | Description | Type |
---|---|---|---|
1 | CLIENTNUM |
Client number. Unique identifier for the customer holding the account. | 🔢 |
2 | Attrition_Flag |
Internal event (customer activity) variable - if the account is closed then 1 else 0 | 🔀* |
3 | Costumer_Age |
Demographic variable - Customer's Age in Years | 🔢 |
4 | Gender |
Demographic variable - M=Male, F=Female | 🔀* |
5 | Dependent_Count |
Demographic variable - Number of dependents | 🔢 |
6 | Education_Level |
Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.) | 🆎 |
7 | Marital_Status |
Demographic variable - Married, Single, Divorced, Unknown | 🆎 |
8 | Income_Category |
Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > | 🆎 |
9 | Card_Category |
Product Variable - Type of Card (Blue, Silver, Gold, Platinum) | 🆎 |
10 | Months_on_book |
Period of relationship with bank | 🔢 |
11 | Total_Relationhip_Count |
Total no. of products held by the customer | 🔢 |
12 | Months_Inactive_12_mon |
No. of months inactive in the last 12 months | 🔢 |
13 | Contacts_Count_12_mon |
No. of Contacts in the last 12 months | 🔢 |
14 | Credit_Limit |
Credit Limit on the Credit Card | 🔢 |
15 | Total_Revolving_Bal |
Total Revolving Balance on the Credit Card | 🔢 |
16 | Avg_Open_To_Buy |
Open to Buy Credit Line (Average of last 12 months) | 🔢 |
17 | Total_Amt_Chng_Q4_Q1 |
Change in Transaction Amount (Q4 over Q1) | 🔢 |
18 | Total_Trans_Amt |
Total Transaction Amount (Last 12 months) | 🔢 |
19 | Total_Trans_Ct |
Total Transaction Count (Last 12 months) | 🔢 |
20 | Total_Ct_Chng_Q4_Q1 |
Change in Transaction Count (Q4 over Q1) | 🔢 |
21 | Avg_Utilization_Ratio |
Average Card Utilization Ratio | 🔢 |
22 | Naive_Bayes_Cla..._1 |
||
23 | Naive_Bayes_Cla..._2 |
* after conversion
To import and use the dataset in an R script or R Markdown file, use the following code.
# Set working directory as this directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
# Load the dataset from the datasets/ folder
bank <- read.csv("path/to/datasets/BankChurners.csv", sep = ",")
Note
The preprocessing steps can be found in the r_scripts/preprocessing.r
file.
As suggested from the Kaggel description of the dataset, we removed the last two columns.
# Remove the last two columns as suggested in the README
bank <- bank[, -c(22, 23)]
Then, we removed the CLIENTNUM
column as it is just an identifier.
# Remove the first column as it is just an index
bank <- bank[, -1]
After that, it was necessary to convert the Attrition_Flag
column to a binary variable:
0
if the account is not closed, i.e. for theExisting Customer
value1
if the account is closed, i.e. for theAttrited Customer
value
# Convert the Attrition_Flag column to a binary variable
bank$Attrition_Flag <- ifelse(bank$Attrition_Flag == "Attrited Customer", 1, 0)
Accordingly all categorical variables were coverted to factors:
# Convert all categorical variables to factors
bank$Gender <- as.factor(bank$Gender)
bank$Education_Level <- as.factor(bank$Education_Level)
bank$Marital_Status <- as.factor(bank$Marital_Status)
bank$Income_Category <- as.factor(bank$Income_Category)
bank$Card_Category <- as.factor(bank$Card_Category)
Luckily there were no missing values in the dataset, so we could proceed with the analysis.
Note
The logistic regression can be found in the r_scripts/logistic_regression.r
file.
A model using logistic regression has been built to predict the Attrition_Flag
response variable.
In an initial phase the most relevant variables have been selected to build the model.
The selection criteria used have been the following:
- Variables with a p-value lower than 0.05 in the
glm()
model summary have been selected - Variables with low correlation with the response variable have been removed
- Using the
anova()
test, only the most significant variables (including categorical variables) have been selected - To avoid multicollinearity, variables with low VIF have been selected
The final model has been built using the following variables:
Gender
: the gender of the customerMarital_Status
: the marital status of the customerIncome_Category
: the income category of the customerTotal_Relationship_Count
: the total number of products held by the customerMonths_Inactive_12_mon
: the number of months inactive in the last 12 monthsContacts_Count_12_mon
: the number of contacts in the last 12 monthsTotal_Revolving_Bal
: the total revolving balance on the credit cardTotal_Trans_Amt
: the total transaction amount in the last 12 monthsTotal_Trans_Ct
: the total transaction count in the last 12 monthsTotal_Ct_Chng_Q4_Q1
: the change in transaction count from Q4 to Q1
Aditionally, looking at the data distribution it has been taken the logarihmic values of the Total_Trans_Amt
variable. Also, the Months_Inactive_12_mon
variable has been converted to a factor due to its peculiar distribution (see plot) with the levels 1
, 2
, 3
and 4+
months.
Both these changes have significantly improved the model performance as portrayed by the metrics below.
The result of the ANOVA test is the following:
Analysis of Deviance Table
Model: binomial, link: logit
Response: Attrition_Flag
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 10126 8927.2
Gender 1 14.12 10125 8913.1 0.0001715 ***
Total_Relationship_Count 1 227.81 10124 8685.3 < 2.2e-16 ***
Months_Inactive_12_mon 4 434.03 10120 8251.2 < 2.2e-16 ***
Contacts_Count_12_mon 1 479.80 10119 7771.4 < 2.2e-16 ***
Total_Revolving_Bal 1 608.24 10118 7163.2 < 2.2e-16 ***
Total_Trans_Amt 1 636.63 10117 6526.6 < 2.2e-16 ***
Total_Trans_Ct 1 1724.51 10116 4802.0 < 2.2e-16 ***
Total_Ct_Chng_Q4_Q1 1 439.71 10115 4362.3 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
While the VIF test:
GVIF Df GVIF^(1/(2*Df))
Gender 1.024050 1 1.011954
Total_Relationship_Count 1.119040 1 1.057847
Months_Inactive_12_mon 1.027901 4 1.003446
Contacts_Count_12_mon 1.026478 1 1.013152
Total_Revolving_Bal 1.039958 1 1.019783
Total_Trans_Amt 6.658530 1 2.580413
Total_Trans_Ct 6.762163 1 2.600416
Total_Ct_Chng_Q4_Q1 1.101724 1 1.049630
In the relative R
script, appropiate learning and prediction functions have been defined as well as methods to compute effectveness metrics on the whole dataset and performing a k-fold cross validation.
The effectiveness metrics used so far are the following:
- Accuracy
- AUC
- FPR (False Positive Rate)
- FNR (False Negative Rate)
- Confusion matrix (only in the whole dataset case)
- AIC
- BIC
The Dummy classifier has been taken as a baseline for comparison.
The model fitted on the whole dataset is the following:
Call:
glm(formula = Attrition_Flag ~ ., family = binomial(link = "logit"),
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9883 -0.3308 -0.1429 -0.0468 3.4870
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.59957 0.66799 -0.898 0.369414
GenderM -0.54425 0.07935 -6.858 6.96e-12 ***
Total_Relationship_Count -0.76274 0.04339 -17.580 < 2e-16 ***
Months_Inactive_12_mon1 -3.71092 0.68026 -5.455 4.89e-08 ***
Months_Inactive_12_mon2 -2.39858 0.67176 -3.571 0.000356 ***
Months_Inactive_12_mon3 -1.94577 0.67049 -2.902 0.003708 **
Months_Inactive_12_mon4+ -1.59904 0.68009 -2.351 0.018711 *
Contacts_Count_12_mon 0.52866 0.04167 12.685 < 2e-16 ***
Total_Revolving_Bal -0.74802 0.03870 -19.327 < 2e-16 ***
Total_Trans_Amt 2.61316 0.10463 24.975 < 2e-16 ***
Total_Trans_Ct -4.08983 0.12788 -31.982 < 2e-16 ***
Total_Ct_Chng_Q4_Q1 -0.80523 0.04547 -17.707 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8927.2 on 10126 degrees of freedom
Residual deviance: 4362.3 on 10115 degrees of freedom
AIC: 4386.3
Number of Fisher Scoring iterations: 7
The results obtained on the same training dataset are the following:
----------------------------------------
Predicted
Actual Existing Attrited
Existing 8238 262
Attrited 576 1051
----------------------------------------
Accuracy: 91.73 %
Dummy classifier accuracy: 83.93 %
----------------------------------------
AUC: 93.67 %
Random classifier AUC: 50 %
----------------------------------------
FPR: 3.08 %
FNR: 35.4 %
----------------------------------------
AIC: 4386.343
BIC: 4473.018
----------------------------------------
Note that the accuracy of the Dummy classifier should be taken into account due to the relatively high imbalance of the dataset. The results obtained using a 10-fold cross validation are the following:
----------------------------------------
Average accuracy: 91.66 +/- 1.22 %
----------------------------------------
Average AUC: 93.66 +/- 1.34 %
----------------------------------------
Average FPR: 3.02 +/- 0.75 %
Average FNR: 44.45 +/- 11.22 %
----------------------------------------
Average AIC: 3948.844 +/- 49.15868
Average BIC: 4034.255 +/- 49.15871
----------------------------------------
Metric | Value | Standard Deviation | Assessment Technique |
---|---|---|---|
Accuracy | 91.73% | - | Whole dataset |
AUC | 93.67 % | - | Whole dataset |
Dummy accuracy | 83.93 % | - | Whole dataset |
Random AUC | 50 % | - | Whole dataset |
FPR | 3.08 % | - | Whole dataset |
FNR | 35.4 % | - | Whole dataset |
AIC | 4386.343 | - | Whole dataset |
BIC | 4473.018 | - | Whole dataset |
Accuracy | 91.66 % | 1.22 % | 10-fold CV |
AUC | 93.66 % | 0.34 % | 10-fold CV |
FPR | 3.02 % | 0.75 % | 10-fold CV |
FNR | 44.45 % | 11.22 % | 10-fold CV |
AIC | 3948.844 | 49.15868 | 10-fold CV |
BIC | 4034.255 | 49.15871 | 10-fold CV |
Note
The model with splines can be found in the r_scripts/penalized_regression.r
file.
I decided to make a unique script with general functions that allow the user to select its preference between RIDGE/LASSO
Regarding RIDGE, the results from a 10-fold CV were:
----------------------------------------
Average accuracy: 90.31 +/- 0.56 %
----------------------------------------
Average AUC: 91.53 +/- 1.36 %
----------------------------------------
Average FPR: 1.61 +/- 0.48 %
Average FNR: 88.74 +/- 8.03 %
----------------------------------------
Regarding LASSO, the results from a 10-fold CV were:
----------------------------------------
Average accuracy: 91.42 +/- 0.8 %
----------------------------------------
Average AUC: 93.43 +/- 0.75 %
----------------------------------------
Average FPR: 3.12 +/- 0.79 %
Average FNR: 45.89 +/- 7.79 %
----------------------------------------
NOTES: as we can see, the value of FNR is extremely high😓. Good news is that with ROSE this will improve a lot (see here)
I used the gam function available in MASS package.
Note
The model with splines can be found in the r_scripts/splines.r
file.
Regarding preprocessing I used the codes available in preprocesing.R in logistic_regression.r and testing_ROSE.R. I decided not to delete at first columns 3, 5, 6, 9, 10, 14, 16, 17, 21, and I implemented 2 differents models to compare them and see if those wariables would be helpful in a gam approach.
As we can see from the summary of gamfit_first_try:
Family: gaussian
Link function: identity
Formula:
Attrition_Flag ~ s(Customer_Age) + Gender + Dependent_count +
Education_Level + Marital_Status + Income_Category + Card_Category +
s(Months_on_book) + Total_Relationship_Count + Months_Inactive_12_mon +
Contacts_Count_12_mon + s(Credit_Limit) + s(Total_Revolving_Bal) +
s(Avg_Open_To_Buy) + s(Total_Amt_Chng_Q4_Q1) + s(Total_Trans_Amt) +
s(Total_Trans_Ct) + s(Total_Ct_Chng_Q4_Q1) + s(Avg_Utilization_Ratio)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.368078 0.045587 8.074 7.57e-16 ***
GenderM -0.031407 0.005340 -5.882 4.19e-09 ***
Dependent_count 0.003172 0.002103 1.508 0.13158
Education_LevelDoctorate 0.018007 0.013043 1.381 0.16743
Education_LevelGraduate 0.001397 0.008311 0.168 0.86651
Education_LevelHigh School 0.004338 0.008855 0.490 0.62425
Education_LevelPost-Graduate 0.017628 0.012440 1.417 0.15652
Education_LevelUneducated 0.002446 0.009372 0.261 0.79410
Education_LevelUnknown 0.009723 0.009333 1.042 0.29752
Marital_StatusMarried -0.034573 0.004725 -7.317 2.72e-13 ***
Income_CategoryLess than 120K -0.016113 0.009762 -1.651 0.09885 .
Card_CategoryGold 0.041499 0.022734 1.825 0.06796 .
Card_CategoryPlatinum 0.049960 0.052124 0.958 0.33784
Card_CategorySilver 0.003453 0.011712 0.295 0.76815
Total_Relationship_Count -0.020870 0.001654 -12.615 < 2e-16 ***
Months_Inactive_12_mon1 -0.210735 0.043096 -4.890 1.02e-06 ***
Months_Inactive_12_mon2 -0.160449 0.042975 -3.734 0.00019 ***
Months_Inactive_12_mon3 -0.132801 0.042942 -3.093 0.00199 **
Months_Inactive_12_mon4+ -0.118787 0.043574 -2.726 0.00642 **
Contacts_Count_12_mon 0.025425 0.002132 11.924 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Customer_Age) 8.8777 8.9915 29.833 < 2e-16 ***
s(Months_on_book) 1.0000 1.0000 1.826 0.1766
s(Credit_Limit) 0.5021 0.5021 38.230 1.20e-05 ***
s(Total_Revolving_Bal) 8.8682 8.9905 52.414 < 2e-16 ***
s(Avg_Open_To_Buy) 1.3158 1.7523 12.087 2.81e-05 ***
s(Total_Amt_Chng_Q4_Q1) 8.2570 8.8236 58.890 < 2e-16 ***
s(Total_Trans_Amt) 8.9462 8.9990 407.685 < 2e-16 ***
s(Total_Trans_Ct) 8.4391 8.8853 250.697 < 2e-16 ***
s(Total_Ct_Chng_Q4_Q1) 7.7431 8.5535 40.753 < 2e-16 ***
s(Avg_Utilization_Ratio) 3.9858 4.9592 2.883 0.0131 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Rank: 109/110
R-sq.(adj) = 0.611 Deviance explained = 61.4%
GCV = 0.05286 Scale est. = 0.052453 n = 10127
the edf of Months_on_book, Credit_Limit, Avg_Open_To_Buy are <2. So I decided to include in the gamfit model in an additive way without using splines.
In the gamfit_less_vars I considered only the variables: Gender, Marital_Status, Income_Category,Total_Relationship_Count,Months_Inactive_12_month, Contacts_Count_12_mon; and as splines:Total_Revolving_Bal, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1.
For the assessment the code is almost the same as for Logistic regression. I slightly changed the code of cv in order to let it execute with different learn function, to compare the results of the 2 models.
For gamfit:
----------------------------------------
Predicted
Actual Existing Attrited
Existing 8378 122
Attrited 371 1256
----------------------------------------
Accuracy: 95.13 %
Dummy classifier accuracy: 83.93 %
----------------------------------------
AUC: 98.2 %
Dummy classifier AUC: 50 %
----------------------------------------
FPR: 1.44 %
FNR: 22.8 %
----------------------------------------
AIC: -1032.97
BIC: -473.1245
----------------------------------------
For gamfit_less_vars:
----------------------------------------
Predicted
Actual Existing Attrited
Existing 8331 169
Attrited 430 1197
----------------------------------------
Accuracy: 94.09 %
Dummy classifier accuracy: 83.93 %
----------------------------------------
AUC: 97.52 %
Dummy classifier AUC: 50 %
----------------------------------------
FPR: 1.99 %
FNR: 26.43 %
----------------------------------------
AIC: -219.6766
BIC: 98.15811
----------------------------------------
For gamfit with 10 CV:
----------------------------------------
Average accuracy: 95.13 +/- 1.24 %
----------------------------------------
Average AUC: 98.22 +/- 0.55 %
----------------------------------------
Average FPR: 1.4 +/- 0.4 %
Average FNR: 27.16 +/- 8.87 %
----------------------------------------
Average AIC: -1032.97 +/- 0
Average BIC: -473.1245 +/- 0
----------------------------------------
For gamfit_less_vars with 10 CV:
----------------------------------------
Average accuracy: 94.09 +/- 0.62 %
----------------------------------------
Average AUC: 97.53 +/- 0.4 %
----------------------------------------
Average FPR: 1.93 +/- 0.43 %
Average FNR: 31.94 +/- 6.58 %
----------------------------------------
Average AIC: -219.6766 +/- 0
Average BIC: 98.15811 +/- 0
----------------------------------------
NB: I noticed that in the preprocessing steps, we modified the Attrition_Flag variable making it binary, but we let it NUMERICAL (forcing models to do regression on it)! I don't know if it was intended, but either case it's worth a check. I already made a change in my file called ensamble.R
Note
The model with ensamble can be found in the r_scripts/ensambe.R
file.
I first looked at AdaBoost method, first with a static train-test division of 80%- 20% and then with a 10-fold cross validation. I used the adabag
package.
Firstly I tried to use the whole dataset.
I made by hand all the assessments, as the models I implemented didn't come with all the parameters required in the functions into the assessment_utils.R
script.
----------------------------------------
Predicted
Actual Existing Attrited
Existing 1680 20
Attrited 48 277
----------------------------------------
Accuracy: 96.64 %
Dummy classifier accuracy: 83.95 %
----------------------------------------
AUC: 98.98 %
Dummy classifier AUC: 50 %
----------------------------------------
FPR: 1.18 %
FNR: 14.77 %
----------------------------------------
Variable importance:
Variable Mean_Gini_Decrease
1 Total_Trans_Amt 21.2023085
2 Total_Trans_Ct 16.0602470
3 Total_Amt_Chng_Q4_Q1 12.6190669
4 Total_Ct_Chng_Q4_Q1 6.3711922
5 Total_Revolving_Bal 6.1183427
6 Customer_Age 5.6225678
7 Credit_Limit 5.0469853
8 Total_Relationship_Count 5.0028304
9 Avg_Open_To_Buy 3.7004296
10 Education_Level 3.6121165
11 Contacts_Count_12_mon 3.4454316
12 Months_on_book 2.9867131
13 Months_Inactive_12_mon 2.4091363
14 Avg_Utilization_Ratio 2.1693641
15 Marital_Status 1.2552304
16 Dependent_count 1.2051558
17 Gender 0.7694100
18 Card_Category 0.2062053
19 Income_Category 0.1972667
----------------------------------------
I used the same parameters as before, but I used a 10-fold cross validation.
----------------------------------------
Average accuracy: 97.33 +/- 0.65 %
----------------------------------------
Average AUC: 99.39 +/- 0.29 %
----------------------------------------
Average FPR: 1.12 +/- 0.36 %
Average FNR: 10.76 +/- 2.53 %
----------------------------------------
Average variable importance ranking:
Variable Mean_Gini_Decrease Std_Dev
1 Total_Trans_Amt 21.3956578 1.02211133
2 Total_Trans_Ct 17.1991349 0.51198369
3 Total_Amt_Chng_Q4_Q1 11.3672550 0.41708711
4 Total_Revolving_Bal 7.2367961 0.25506424
5 Total_Ct_Chng_Q4_Q1 6.2949893 0.49071970
6 Customer_Age 5.1833259 0.35377941
7 Total_Relationship_Count 4.7237441 0.24920299
8 Credit_Limit 4.4433490 0.22454304
9 Education_Level 3.9403137 0.18198034
10 Avg_Open_To_Buy 3.7138876 0.21357719
11 Months_on_book 3.4115906 0.15480877
12 Contacts_Count_12_mon 3.1450846 0.20233821
13 Months_Inactive_12_mon 2.7952042 0.29731693
14 Avg_Utilization_Ratio 2.0598053 0.39425455
15 Dependent_count 1.1766892 0.14792473
16 Marital_Status 0.9806874 0.17593701
17 Gender 0.5731588 0.10085429
18 Card_Category 0.2416357 0.11075239
19 Income_Category 0.1176907 0.06116263
----------------------------------------
I used the randomForest
package, and I fitted the model with default parameters.
----------------------------------------
Predicted
Actual Existing Attrited
Existing 1680 20
Attrited 64 261
----------------------------------------
Accuracy: 95.85 %
Dummy classifier accuracy: 83.95 %
----------------------------------------
AUC: 98.71 %
Dummy classifier AUC: 50 %
----------------------------------------
FPR: 1.18 %
FNR: 19.69 %
----------------------------------------
Variable importance:
Variable Mean_Gini_Decrease
16 Total_Trans_Amt 399.827927
17 Total_Trans_Ct 384.831802
13 Total_Revolving_Bal 264.123972
18 Total_Ct_Chng_Q4_Q1 230.778923
9 Total_Relationship_Count 142.053837
15 Total_Amt_Chng_Q4_Q1 141.913448
19 Avg_Utilization_Ratio 129.581664
1 Customer_Age 73.329894
12 Credit_Limit 71.973182
14 Avg_Open_To_Buy 67.580828
10 Months_Inactive_12_mon 59.108270
11 Contacts_Count_12_mon 58.958746
8 Months_on_book 52.208794
4 Education_Level 43.188889
3 Dependent_count 27.821580
2 Gender 19.235724
5 Marital_Status 11.682715
7 Card_Category 5.858160
6 Income_Category 3.350073
----------------------------------------
I used the same parameters as before, but I used a 10-fold cross validation.
----------------------------------------
Average accuracy: 99.16 +/- 0.31 %
----------------------------------------
Average AUC: 99.9 +/- 0.07 %
----------------------------------------
Average FPR: 0.21 +/- 0.16 %
Average FNR: 4.28 +/- 1.96 %
----------------------------------------
Average variable importance ranking:
Variable Mean_Gini_Decrease Std_Dev
1 Total_Trans_Amt 400.370327 0.7560910
2 Total_Trans_Ct 377.983812 0.6455279
3 Total_Revolving_Bal 251.801778 0.3734412
4 Total_Ct_Chng_Q4_Q1 236.051413 0.6783163
5 Avg_Utilization_Ratio 141.900558 0.2279646
6 Total_Relationship_Count 140.834560 0.1284268
7 Total_Amt_Chng_Q4_Q1 139.560874 0.1362001
8 Customer_Age 73.525782 0.6375022
9 Credit_Limit 72.771843 2.1550593
10 Avg_Open_To_Buy 68.327976 0.7628449
11 Contacts_Count_12_mon 59.880585 0.9598183
12 Months_Inactive_12_mon 58.126678 0.8253341
13 Months_on_book 52.115183 6.5760340
14 Education_Level 44.475905 0.7811312
15 Dependent_count 28.768839 1.6842069
16 Gender 18.410534 2.0749535
17 Marital_Status 11.079980 4.9998442
18 Card_Category 5.694400 4.8870951
19 Income_Category 3.357699 6.5292501
----------------------------------------
I took away the variables non considered also in the other models, as variable importance may not be an indicative value of which variables are viable to be taken away.
Note
Let me know if it may be sensible to remove the least important variables according to the ensemble methods, as they are different (I would take away the ones with mean gini decrease < 1(or 2) for boosting and <10(or 20) for randomforest)!!.
----------------------------------------
Predicted
Actual Existing Attrited
Existing 1673 27
Attrited 55 270
----------------------------------------
Accuracy: 95.95 %
Dummy classifier accuracy: 83.95 %
----------------------------------------
AUC: 98.4 %
Dummy classifier AUC: 50 %
----------------------------------------
FPR: 1.59 %
FNR: 16.92 %
----------------------------------------
Variable importance:
Variable Mean_Gini_Decrease
1 Total_Trans_Ct 32.1290080
2 Total_Trans_Amt 28.0619350
3 Total_Ct_Chng_Q4_Q1 11.6521977
4 Total_Revolving_Bal 11.4968680
5 Total_Relationship_Count 6.9770316
6 Contacts_Count_12_mon 3.9556203
7 Months_Inactive_12_mon 3.6111736
8 Gender 0.8720491
9 Marital_Status 0.8677845
10 Income_Category 0.3763321
----------------------------------------
I used the same parameters as before, but I used a 10-fold cross validation.
----------------------------------------
Average accuracy: 95.49 +/- 0.53 %
----------------------------------------
Average AUC: 98.55 +/- 0.21 %
----------------------------------------
Average FPR: 2.27 +/- 0.56 %
Average FNR: 16.22 +/- 1.99 %
----------------------------------------
Average variable importance ranking:
Variable Mean_Gini_Decrease Std_Dev
1 Total_Trans_Ct 35.2607130 1.51321544
2 Total_Trans_Amt 27.1277284 0.63385074
3 Total_Revolving_Bal 12.3358661 0.66433659
4 Total_Ct_Chng_Q4_Q1 10.2410856 0.60937211
5 Total_Relationship_Count 6.8953816 0.57825762
6 Contacts_Count_12_mon 3.6713650 0.40844293
7 Months_Inactive_12_mon 2.7927028 0.31481083
8 Gender 0.8494183 0.12389372
9 Marital_Status 0.6127634 0.18474584
10 Income_Category 0.2129757 0.08584478
----------------------------------------
I used the randomForest
package, and I fitted the model with default parameters.
----------------------------------------
Predicted
Actual Existing Attrited
Existing 1676 24
Attrited 65 260
----------------------------------------
Accuracy: 95.6 %
Dummy classifier accuracy: 83.95 %
----------------------------------------
AUC: 98.14 %
Dummy classifier AUC: 50 %
----------------------------------------
FPR: 1.41 %
FNR: 20 %
----------------------------------------
Variable importance:
Variable Mean_Gini_Decrease
8 Total_Trans_Amt 529.807126
9 Total_Trans_Ct 485.608407
7 Total_Revolving_Bal 393.019449
10 Total_Ct_Chng_Q4_Q1 318.191270
4 Total_Relationship_Count 205.084791
6 Contacts_Count_12_mon 90.515870
5 Months_Inactive_12_mon 85.966817
1 Gender 35.993873
2 Marital_Status 23.708807
3 Income_Category 8.235691
----------------------------------------
I used the same parameters as before, but I used a 10-fold cross validation.
----------------------------------------
Average accuracy: 99.14 +/- 0.32 %
----------------------------------------
Average AUC: 99.79 +/- 0.11 %
----------------------------------------
Average FPR: 0.29 +/- 0.14 %
Average FNR: 3.93 +/- 1.96 %
----------------------------------------
Average variable importance ranking:
Variable Mean_Gini_Decrease Std_Dev
1 Total_Trans_Amt 517.792419 0.5958843
2 Total_Trans_Ct 486.881318 0.2789695
3 Total_Revolving_Bal 393.766226 0.2129571
4 Total_Ct_Chng_Q4_Q1 329.641423 1.8840771
5 Total_Relationship_Count 201.542137 0.9021276
6 Contacts_Count_12_mon 89.244495 0.8485140
7 Months_Inactive_12_mon 85.594244 4.4140412
8 Gender 35.987968 5.0163658
9 Marital_Status 24.198076 5.2970795
10 Income_Category 8.085316 5.1896773
----------------------------------------
NB: my code is still really verbose and has much more computation than required, but it is just as a backup and validation to see if "manually-computed" coefficients were consistent with the ones given from the libraries. TODO: Fix Dummy AUC (50% doesn't seem right)
I have tried to use the Decision Trees method via classification trees to consider modelling the predictor variable Attrition Flag.
Note
The model with Decision Trees and relevant procedures tried can be found in the r_scripts/Decision_Trees.r
file.
I started with preprocessing the dataset on similar lines as done before for previous models and converted traget variable Attrition flag into binary variable (0 and 1) and converted relevant categorical variables into factors.
I took the approach of building classification tree model by starting with dummy classifier ( as a baseline) followed by full tree and later optimising it tree with less variables(reduced tree model),k-fold tree and finally with hyperparameter tuned model.Firstly data partion was created to segregate train and test data.
##Dummy Classifier tree : to Predict the majority class for all instances
ran predict on train_data (considering all other 19 predictors) and below are their performance indexes
table(predictions)
predictions
0 1
1732 293
Method Metric Value
Full_tree Accuracy 93.28395
Full_tree Precision 95.15012
Full_tree Recall 96.94118
Full_tree Specificity 74.15385
Full_tree F1_Score 96.03730
Full_tree AUC_ROC 85.54751
to deduce about less number of variables ,I tried to run glm procedure (for logistric regression) to decide the variables on the basis of p-values.
summary(lr1)
Call:
glm(formula = Attrition_Flag ~ ., family = binomial, data = bank)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.723e+00 4.748e-01 14.162 < 2e-16 ***
Customer_Age -6.131e-03 7.711e-03 -0.795 0.426532
GenderM -8.938e-01 1.455e-01 -6.142 8.16e-10 ***
Dependent_count 1.358e-01 2.998e-02 4.530 5.89e-06 ***
Education_LevelDoctorate 3.689e-01 2.081e-01 1.773 0.076218 .
Education_LevelGraduate -5.798e-03 1.396e-01 -0.042 0.966864
Education_LevelHigh School 1.026e-02 1.488e-01 0.069 0.945016
Education_LevelPost-Graduate 3.112e-01 2.050e-01 1.518 0.128952
Education_LevelUneducated 6.955e-02 1.573e-01 0.442 0.658477
Education_LevelUnknown 1.329e-01 1.554e-01 0.855 0.392310
Marital_StatusMarried -4.994e-01 1.544e-01 -3.234 0.001219 **
Marital_StatusSingle 1.081e-01 1.549e-01 0.698 0.485248
Marital_StatusUnknown 4.528e-02 1.962e-01 0.231 0.817467
Income_Category$40K - $60K -9.083e-01 2.026e-01 -4.484 7.33e-06 ***
Income_Category$60K - $80K -6.405e-01 1.791e-01 -3.576 0.000349 ***
Income_Category$80K - $120K -2.983e-01 1.663e-01 -1.794 0.072811 .
Income_CategoryLess than $40K -7.702e-01 2.190e-01 -3.516 0.000438 ***
Income_CategoryUnknown -8.321e-01 2.322e-01 -3.584 0.000338 ***
Card_CategoryGold 1.066e+00 3.521e-01 3.026 0.002475 **
Card_CategoryPlatinum 9.816e-01 6.813e-01 1.441 0.149654
Card_CategorySilver 4.502e-01 1.962e-01 2.294 0.021778 *
Months_on_book -4.685e-03 7.673e-03 -0.611 0.541484
Total_Relationship_Count -4.493e-01 2.750e-02 -16.338 < 2e-16 ***
Months_Inactive_12_mon 5.078e-01 3.793e-02 13.387 < 2e-16 ***
Contacts_Count_12_mon 5.133e-01 3.655e-02 14.044 < 2e-16 ***
Credit_Limit -1.971e-05 6.860e-06 -2.873 0.004064 **
Total_Revolving_Bal -9.321e-04 7.207e-05 -12.934 < 2e-16 ***
Avg_Open_To_Buy NA NA NA NA
Total_Amt_Chng_Q4_Q1 -4.262e-01 1.878e-01 -2.269 0.023253 *
Total_Trans_Amt 4.855e-04 2.295e-05 21.154 < 2e-16 ***
Total_Trans_Ct -1.192e-01 3.731e-03 -31.944 < 2e-16 ***
Total_Ct_Chng_Q4_Q1 -2.798e+00 1.889e-01 -14.813 < 2e-16 ***
Avg_Utilization_Ratio -1.253e-01 2.470e-01 -0.507 0.612020
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8927.2 on 10126 degrees of freedom
Residual deviance: 4710.6 on 10095 degrees of freedom
AIC: 4774.6
Number of Fisher Scoring iterations: 6
Summary shows NA values also. This is case when alias are present. Then I checked alias. The variable "Avg_Open_To_Buy" is alias. After removal of this variable. LR ran again and this time we got summary of lr2 model and also got significant variables as below- #Gender, Dependent_count, Marital_Status, Income_Category, Card_Category, Total_Relationship_Count, #Months_Inactive_12_mon, Contacts_Count_12_mon, Credit_Limit,Total_Revolving_Bal, #Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1 multicollinearity is also checked (which was not found).
GVIF Df GVIF^(1/(2*Df))
Customer_Age 2.819426 1 1.679115
Gender 3.707551 1 1.925500
Dependent_count 1.056910 1 1.028061
Education_Level 1.036146 6 1.002963
Marital_Status 1.085814 3 1.013816
Income_Category 5.092476 5 1.176774
Card_Category 1.586441 3 1.079951
Months_on_book 2.811208 1 1.676666
Total_Relationship_Count 1.194193 1 1.092791
Months_Inactive_12_mon 1.057074 1 1.028141
Contacts_Count_12_mon 1.043340 1 1.021440
Credit_Limit 2.710246 1 1.646282
Total_Revolving_Bal 2.551374 1 1.597302
Total_Amt_Chng_Q4_Q1 1.158843 1 1.076496
Total_Trans_Amt 4.386664 1 2.094436
Total_Trans_Ct 4.528855 1 2.128111
Total_Ct_Chng_Q4_Q1 1.181094 1 1.086782
Avg_Utilization_Ratio 2.982879 1 1.727101
we ran predict procedure for tree-reduced model on train data and got below metrics on validaton data -
table(predictions_rm)
predictions_rm
0 1
1733 292
Metric Value
Accuracy 93.03704
Precision 94.97980
Recall 96.82353
Specificity 73.23077
F1_Score 95.89281
AUC_ROC 85.02715
At this stage,Comparison with Full Tree Model:
The reduced model tree maintains high performance similar to the full tree model across various metrics. This suggests that reducing the number of variables did not significantly impact the model's overall predictive power. It could potentially lead to a more interpretable and efficient model.
Further Considerations to be checked-
While the metrics are important, we must also consider the specific goals and context of business problem. Depending on the business requirements, we may prioritize certain metrics over others or even try some more iterations like k-fold.
table(predictions_kfold)
predictions_kfold
0 1
1757 268
Metric Value
Accuracy 90.07407
Precision 92.65794
Recall 95.76471
Specificity 60.30769
F1_Score 94.18571
AUC_ROC 78.03620
The full tree generally performs slightly better than the reduced model tree in most metrics, indicating that the additional variables in the full tree contribute to improved performance.Both the full tree and reduced model tree outperform the k-fold tree across most metrics, suggesting that the k-fold tree might have slightly reduced predictive power or stability.we can consider the trade-offs between model complexity, interpretability, and performance when deciding between the full tree and the reduced model tree.we can also evaluate whether the observed differences in metrics are practically significant and align with the business goals. to try further we can ontinue with the iterative process, and consider further steps such as hyperparameter tuning or exploring alternative algorithms based on these comparisons.So, next we try tuning a tree of reduced model with hyperparameters.
Fine-tuning the model's hyperparameters, such as the complexity parameter (cp), helps in achieving a balance between model complexity and performance. This step can lead to a more optimized and generalizable model.we defined some parameters in control for hyperparameters like minsplit,minbucket = round(5 / 3),maxdepth = 3, cp = 0.011 and got below metrics for tuned model.
table(predictions_tune)
predictions_tune
0 1
1736 289
Metric Value
Accuracy 91.90123
Precision 94.23963
Recall 96.23529
Specificity 69.23077
F1_Score 95.22701
AUC_ROC 82.73303
The tuned model shows improvements in precision, recall, and F1 score compared to the k-fold tree, indicating a more balanced and accurate model. The specificity of the tuned model is lower than that of the full tree and reduced model tree.we can consider the implications for identifying non-churned customers in terms of business context.The decision on whether to choose one of the existing models or explore a different approach, such as using a Random Forest, depends on several factors:
we have to assess how well each model aligns with the performance goals of the business problem. If one of the models consistently outperforms others across key metrics important for the business, it may be a strong candidate.
we also have to consider the interpretability of the models. Decision trees are inherently interpretable, and if interpretability is crucial, the reduced model tree might be preferred. Random Forests, being an ensemble method, provide powerful predictive capabilities but are generally less interpretable.
we also know that Random Forests, being an ensemble method, are computationally more intensive compared to individual decision trees.
If the goal is to further improve predictive performance and handle complex relationships, we may explore ensemble methods like Random Forests. Random Forests have the potential to capture more nuanced patterns and reduce overfitting.
In this section we will explore how ROSE package might impact on the performance of all the models implemented above
Note
The ROSE package has been tested in r_scripts/testing_ROSE/
There was an evident imbalance among the target variable's classes:
>>table(bank$Attrition_Flag)
0 1
8500 1627
- Attrited customers (1): 1627
- Existing customers (0): 8500
- Attrited customers proportion: 16.06596 %
A new (synthetic) dataset was obtained by applying ROSE package, as follows:
>>library(ROSE)
>>bank_balanced<- ROSE(Attrition_Flag~.,data=bank,seed = 123)$data
>>table(bank$Attrition_Flag)
0 1
5123 5004
- Attrited customers (1): 5004
- Existing customers (0): 5123
- Attrited customers proportion: 49.41246 %
Note
Look at r_scripts/testing_ROSE/ROSE_logistic_regression.R
Single-run result
----------------------------------------
Predicted
Actual Existing Attrited
Existing 4238 885
Attrited 875 4129
----------------------------------------
Accuracy: 82.62 %
Dummy classifier accuracy: 50.59 %
----------------------------------------
AUC: 90.27 %
Dummy classifier AUC: 50 %
----------------------------------------
FPR: 17.28 %
FNR: 17.49 %
----------------------------------------
AIC: 8036.421
BIC: 8137.542
----------------------------------------
10 fold CV result:
----------------------------------------
Average accuracy: 82.57 +/- 1.49 %
----------------------------------------
Average AUC: 90.2 +/- 0.87 %
----------------------------------------
Average FPR: 17.41 +/- 2.87 %
Average FNR: 17.51 +/- 2.19 %
----------------------------------------
Average AIC: 7233.987 +/- 33.24618
Average BIC: 7333.634 +/- 33.24607
----------------------------------------
Note
Look at r_scripts/testing_ROSE/ROSE_penalized_regression.R
Results for RIDGE regression:
----------------------------------------
Average accuracy: 81.43 +/- 1.22 %
----------------------------------------
Average AUC: 89.17 +/- 1.08 %
----------------------------------------
Average FPR: 17.73 +/- 1.99 %
Average FNR: 19.5 +/- 1.43 %
----------------------------------------
Results for LASSO regression:
----------------------------------------
Average accuracy: 81.47 +/- 1.07 %
----------------------------------------
Average AUC: 89.53 +/- 1.2 %
----------------------------------------
Average FPR: 18.42 +/- 1.93 %
Average FNR: 18.72 +/- 1.47 %
----------------------------------------
Note
Look at r_scripts/testing_ROSE/ROSE_splines.R
gamfit with static train/test division
----------------------------------------
Predicted
Actual Existing Attrited
Existing 4307 816
Attrited 694 4310
----------------------------------------
Accuracy: 85.09 %
Dummy classifier accuracy: 50.59 %
----------------------------------------
AUC: 92.94 %
Dummy classifier AUC: 50 %
----------------------------------------
FPR: 15.93 %
FNR: 13.87 %
----------------------------------------
AIC: 6993.906
BIC: 7493.183
----------------------------------------
gamfit_less_vars with static train/test division
----------------------------------------
Predicted
Actual Existing Attrited
Existing 4279 844
Attrited 735 4269
----------------------------------------
Accuracy: 84.41 %
Dummy classifier accuracy: 50.59 %
----------------------------------------
AUC: 92.32 %
Dummy classifier AUC: 50 %
----------------------------------------
FPR: 16.47 %
FNR: 14.69 %
----------------------------------------
AIC: 7223.264
BIC: 7503.034
----------------------------------------
gamfit with CV
----------------------------------------
Average accuracy: 84.68 +/- 1.75 %
----------------------------------------
Average AUC: 92.52 +/- 0.93 %
----------------------------------------
Average FPR: 16.75 +/- 3.27 %
Average FNR: 13.98 +/- 1.81 %
----------------------------------------
Average AIC: 6299.932 +/- 40.04351
Average BIC: 6778.618 +/- 44.47466
----------------------------------------
gamfit_less_vars with CV
----------------------------------------
Average accuracy: 84.12 +/- 1.36 %
----------------------------------------
Average AUC: 92.11 +/- 0.93 %
----------------------------------------
Average FPR: 17.13 +/- 3.12 %
Average FNR: 14.72 +/- 1.62 %
----------------------------------------
Average AIC: 6505.484 +/- 39.65097
Average BIC: 6772.415 +/- 39.65286
----------------------------------------
Note
Look at r_scripts/testing_ROSE/ROSE_ensamble.R
EVALUTATION OF MODELS WITH REDUCED ATTRIBUTES NOT DONE YET
(1) AdaBoost with static train/test division
----------------------------------------
Predicted
Actual Existing Attrited
Existing 889 135
Attrited 143 857
----------------------------------------
Accuracy: 86.26 %
Dummy classifier accuracy: 50.59 %
----------------------------------------
AUC: 94.27 %
Dummy classifier AUC: 50 %
----------------------------------------
FPR: 13.18 %
FNR: 14.3 %
----------------------------------------
Variable importance:
Variable Mean_Gini_Decrease
1 Total_Trans_Ct 30.73697237
2 Total_Revolving_Bal 11.78081861
3 Total_Trans_Amt 9.92451533
4 Total_Relationship_Count 6.53934880
5 Months_Inactive_12_mon 5.67923580
6 Total_Amt_Chng_Q4_Q1 5.45173144
7 Total_Ct_Chng_Q4_Q1 5.36548361
8 Contacts_Count_12_mon 4.80681664
9 Months_on_book 4.27897800
10 Avg_Utilization_Ratio 3.10059310
11 Customer_Age 2.74046246
12 Credit_Limit 2.47675632
13 Avg_Open_To_Buy 1.96121788
14 Dependent_count 1.67558174
15 Education_Level 1.40795148
16 Gender 1.28538718
17 Card_Category 0.35993893
18 Marital_Status 0.35973450
19 Income_Category 0.06847582
----------------------------------------
(2) Random Forest with static train/test division
----------------------------------------
Predicted
Actual Existing Attrited
Existing 895 129
Attrited 122 878
----------------------------------------
Accuracy: 87.6 %
Dummy classifier accuracy: 50.59 %
----------------------------------------
AUC: 94.47 %
Dummy classifier AUC: 50 %
----------------------------------------
FPR: 12.6 %
FNR: 12.2 %
----------------------------------------
Variable importance:
Variable Mean_Gini_Decrease
17 Total_Trans_Ct 900.08843
13 Total_Revolving_Bal 495.51746
18 Total_Ct_Chng_Q4_Q1 424.42724
16 Total_Trans_Amt 311.82450
9 Total_Relationship_Count 274.31340
11 Contacts_Count_12_mon 210.51677
15 Total_Amt_Chng_Q4_Q1 191.69802
19 Avg_Utilization_Ratio 189.09073
10 Months_Inactive_12_mon 175.51852
8 Months_on_book 144.87180
1 Customer_Age 139.52477
14 Avg_Open_To_Buy 135.38430
12 Credit_Limit 128.35728
3 Dependent_count 124.38720
4 Education_Level 116.00039
2 Gender 42.53218
5 Marital_Status 20.06260
7 Card_Category 17.94270
6 Income_Category 8.24609
----------------------------------------
(3) AdaBoost with CV
----------------------------------------
Average accuracy: 87.25 +/- 0.86 %
----------------------------------------
Average AUC: 94.57 +/- 0.8 %
----------------------------------------
Average FPR: 13.22 +/- 1.3 %
Average FNR: 12.27 +/- 1.11 %
----------------------------------------
Average variable importance ranking:
Variable Mean_Gini_Decrease Std_Dev
1 Total_Trans_Ct 34.75402277 1.91066726
2 Total_Revolving_Bal 12.94150229 1.37714803
3 Total_Trans_Amt 8.86350309 0.58198055
4 Total_Relationship_Count 7.21951758 0.38540495
5 Total_Ct_Chng_Q4_Q1 6.38904043 0.32115247
6 Months_Inactive_12_mon 5.56077118 0.24868059
7 Total_Amt_Chng_Q4_Q1 5.03931768 0.42429973
8 Contacts_Count_12_mon 4.52194348 0.23543461
9 Months_on_book 3.31371989 0.55059750
10 Avg_Utilization_Ratio 2.42278591 0.23323968
11 Customer_Age 1.94948740 0.29291633
12 Dependent_count 1.57882847 0.24071690
13 Avg_Open_To_Buy 1.39104697 0.24696721
14 Education_Level 1.26301133 0.21294553
15 Gender 1.18618812 0.21157629
16 Credit_Limit 1.00167942 0.20688228
17 Marital_Status 0.36009097 0.07319159
18 Card_Category 0.20161414 0.10068651
19 Income_Category 0.04192886 0.04021035
----------------------------------------
(4) Random Forest with CV
----------------------------------------
Average accuracy: 87.4 +/- 0.94 %
----------------------------------------
Average AUC: 94.59 +/- 0.55 %
----------------------------------------
Average FPR: 13.27 +/- 1.64 %
Average FNR: 11.98 +/- 1.38 %
----------------------------------------
Average variable importance ranking:
Variable Mean_Gini_Decrease Std_Dev
1 Total_Trans_Ct 1007.348407 1.1698639
2 Total_Revolving_Bal 563.033669 2.4616142
3 Total_Ct_Chng_Q4_Q1 469.475970 2.3207382
4 Total_Trans_Amt 364.184981 1.7972846
5 Total_Relationship_Count 300.915929 0.9388124
6 Contacts_Count_12_mon 230.128678 0.3218984
7 Total_Amt_Chng_Q4_Q1 218.784653 1.1144787
8 Avg_Utilization_Ratio 215.417326 1.3194480
9 Months_Inactive_12_mon 198.560604 4.6139068
10 Months_on_book 160.174048 5.4049504
11 Customer_Age 156.434189 4.8913729
12 Avg_Open_To_Buy 149.690026 1.3836461
13 Credit_Limit 145.677997 9.7837391
14 Dependent_count 140.642761 1.0893933
15 Education_Level 128.893361 3.8013993
16 Gender 52.817680 9.8382519
17 Marital_Status 23.393058 16.5777671
18 Card_Category 21.044929 8.8143522
19 Income_Category 9.318932 3.9632489
----------------------------------------
Note
Look at r_scripts/testing_ROSE/ROSE_Decision_Trees.R
25 ROSE_tree Accuracy 81.79125
26 ROSE_tree Precision 81.96530
27 ROSE_tree Recall 82.06129
28 ROSE_tree Specificity 81.51479
29 ROSE_tree F1_Score 82.01327
30 ROSE_tree AUC_ROC 81.78804