-
Notifications
You must be signed in to change notification settings - Fork 0
/
GroupB_Final.Rmd
1151 lines (940 loc) · 57 KB
/
GroupB_Final.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "GroupB_Final"
author: "S.Carpenè, G.Fantuzzi, V.Nigam, M.Tallone, A.Valentinis"
date: "2024-02-14"
output:
html_document:
toc: yes
toc_depth: '3'
df_print: paged
pdf_document:
toc: yes
toc_depth: 3
editor_options:
markdown:
wrap: 72
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r libraries_and_utilities, echo=FALSE, warning=FALSE, message=FALSE}
# Loading the utilities for libraries, assessment and plots
source("markdown_utils.r")
# Eventual additional options
options(rgl.useNULL=TRUE)
# Loading the dataset for eda plots
# Set working directory as this directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
# Load the dataset from the datasets/ folder
bank <- read.csv("./datasets/BankChurners.csv", sep = ",")
# ⚠️ Remove the first and last two columns as suggested in the README
bank <- bank[, -c(1, 22, 23)]
# ⚠️ Convert the Attrition_Flag column to a binary variable:
# - 0: Existing Customer
# - 1: Attrited Customer
bank$Attrition_Flag <- ifelse(bank$Attrition_Flag == "Attrited Customer", 1, 0)
# Convert all categorical variables to factors
bank$Gender <- as.factor(bank$Gender)
bank$Education_Level <- as.factor(bank$Education_Level)
bank$Marital_Status <- as.factor(bank$Marital_Status)
bank$Income_Category <- as.factor(bank$Income_Category)
bank$Card_Category <- as.factor(bank$Card_Category)
# FIlter numerical variables and categorical variables
bank_num <- bank[, sapply(bank, is.numeric)]
bank_cat <- bank[, sapply(bank, is.factor)]
# Bank Logistic dataset
# Load the dataset and pre-process it
bank_logistic <- read.csv("./datasets/BankChurners.csv", sep = ",")
bank_logistic <- bank_logistic[, -c(1, 3, 5, 6, 9, 10, 14, 16, 17, 21, 22, 23)]
bank_logistic$Attrition_Flag <- ifelse(bank_logistic$Attrition_Flag == "Attrited Customer", 1, 0)
# Convert all categorical variables to factors and reorder the levels
bank_logistic$Gender <- as.factor(bank_logistic$Gender)
bank_logistic$Marital_Status <- as.factor(bank_logistic$Marital_Status)
bank_logistic$Marital_Status <- forcats::fct_relevel(bank_logistic$Marital_Status,
"Unknown",
"Single",
"Married",
"Divorced")
# Changing the levels of Marital_Status in either married or not married
bank_logistic$Marital_Status <- fct_collapse(bank_logistic$Marital_Status,
"Married" = c("Married"),
"Not Married" = c("Divorced",
"Single",
"Unknown"))
bank_logistic$Income_Category <- as.factor(bank_logistic$Income_Category)
bank_logistic$Income_Category <- forcats::fct_relevel(bank_logistic$Income_Category,
"Unknown",
"Less than $40K",
"$40K - $60K",
"$60K - $80K",
"$80K - $120K",
"$120K +")
# Changing the levels of income category into a binary variable:
bank_logistic$Income_Category <- fct_collapse(bank_logistic$Income_Category,
"Less than 120K" = c("Unknown",
"Less than $40K",
"$40K - $60K",
"$60K - $80K",
"$80K - $120K"),
"More than 120K" = c("$120K +"))
# Override the Total_Trans_Amt variable with its log !!!
bank_logistic$Total_Trans_Amt <- log(bank_logistic$Total_Trans_Amt)
# Standardization (optional) all columns except response and categorical
bank_logistic[, -c(1, 2, 3, 4)] <- scale(bank_logistic[, -c(1, 2, 3, 4)])
# Rebuiding and checking the model
simple_logistic_model <- learn_logistic(bank_logistic)
# ROSE
bank_logistic_balanced<- ROSE(Attrition_Flag~.,data=bank_logistic,seed = 123)$data
```
# Problem statement and dataset
The `BankChurners` dataset contains information about a bank's costumers and their credit card usage. The dataset is available on the Kaggle website and it is used for a classification problem. The main goal is to predict the `Attrition_Flag` response variable, which basically explains whether a customer will churn or not. The dataset consist of 10127 observations and 21 variables.
# Data preparation and EDA (Exploratory Data Analysis)
With the exclusion of the first variable (`CLIENTNUM`), that is just an identification number, and the last two (`Naive_Bayes_Classifier...`) variables, the dataset contains 19 possible covariates to be used for prediction. Among these, there are 6 categorical variables and 13 numerical variables.
The `Attrition_Flag` variable is the response variable and it is a binary variable. As it's possible to see form the following barplot, the dataset is unbalanced, with only 16.07% of the observations being positive, i.e. a churned costumer, for the `Attrition_Flag` variable. \
For this reason, in the some of the models presented in this report, we attempted to balance the dataset using the *ROSE* package. Below the proportion for the originl dataset and the balanced one are reported.
```{r attrition_flag_barplot, echo=FALSE, fig.width=8, fig.height=5}
p1 <- ggplot(bank, aes(x = as.factor(Attrition_Flag))) +
geom_bar(aes(fill = as.factor(Attrition_Flag)), color = "#FFFFFF") +
scale_fill_manual(values = c("royalblue", "#FF5733"),
name = "Attrition Flag:",
labels = c("Existing", "Attrited")) +
geom_text(aes(label = after_stat(count),
y = after_stat(count)),
stat = "count",
vjust = -0.5,
size = 10) +
labs(x = "Attrition Flag", y = "Count") +
ggtitle("Response variable: Attrition_Flag") +
theme(legend.position = c(.85, .85),
legend.background = element_rect(fill = "transparent"),
legend.direction = "vertical",
legend.title = element_text(size = 10),
aspect.ratio = 1) +
ylim(0, 10000)
p2 <- ggplot(bank_logistic_balanced, aes(x = as.factor(Attrition_Flag))) +
geom_bar(aes(fill = as.factor(Attrition_Flag)), color = "#FFFFFF") +
scale_fill_manual(values = c("royalblue", "#FF5733"),
name = "Attrition Flag:",
labels = c("Existing", "Attrited")) +
geom_text(aes(label = after_stat(count),
y = after_stat(count)),
stat = "count",
vjust = -0.5,
size = 10) +
labs(x = "Attrition Flag", y = "Count") +
ggtitle("Response variable: Attrition_Flag") +
theme(legend.position = c(.85, .85),
legend.background = element_rect(fill = "transparent"),
legend.direction = "vertical",
legend.title = element_text(size = 10),
aspect.ratio = 1) +
ylim(0, 10000)
p1 | p2
```
```{r,echo=FALSE}
cat("Proportion of attrited (original dataset):",
sum(bank$Attrition_Flag==1)/sum(table(bank_logistic$Attrition_Flag))*100,"%")
cat("Proportion of attrited (ROSE):",
sum(bank_logistic_balanced$Attrition_Flag==1)/sum(table(bank_logistic$Attrition_Flag))*100,"%")
```
For what regards the available covariates, instead, we first checked for possible correlation between the variables by computing the correlation matrix and we plotted it using a heatmap. The correlation matrix is the following:
```{r correlation_matrix, message=FALSE, echo=FALSE, fig.width=10, fig.height=10}
# Compute the correlation matrix
corm <- bank_num |>
corrr::correlate() |>
corrr::shave(upper = FALSE)
# Pivot the matrix and fix the labels
corm <- corm |>
tidyr::pivot_longer(
cols = -term,
names_to = "colname",
values_to = "corr"
) |>
dplyr::mutate(
rowname = forcats::fct_inorder(term),
colname = forcats::fct_inorder(colname),
label = dplyr::if_else(is.na(corr), "", sprintf("%1.2f", corr))
)
# Plot the correlation matrix
ggplot(corm, aes(rowname, fct_rev(colname),
fill = corr)) +
geom_tile() +
geom_text(aes(
label = label,
color = abs(corr) < .75
)) +
coord_fixed(expand = FALSE) +
scale_color_manual(
values = c("white", "black"),
guide = "none"
) +
scale_fill_distiller(
palette = "RdYlBu", na.value = "white",
direction = -1, limits = c(-1, 1),
name = "Pearson\nCorrelation:"
) +
labs(x = NULL, y = NULL) +
theme(panel.border = element_rect(color = NA, fill = NA),
legend.position = c(.85, .8),
axis.text.x = element_text(angle = 50, vjust = 1, hjust = 1))
```
As it's possible to see from the first column, the majority of the variables are not highly correlated with the response variable, with the exception of the `Total_Trans_Ct` variable that has a correlation of 0.37. However, an important thing to notice is that the correlation matrix shows some variables that are highly correlated with each other. Among these we can highlight:
* the `Avg_Open_To_Buy` and `Credit_Limit` variables, which are totally correlated: this appeared to us particularly strange as its also highligted by the plot of the two variables below;
* the `Months_on_book` and `Customer_Age` variables, that are reasonably correlated with a correlation of 0.79: indeed one expects that the number of months a customer has been with the bank is related to the customer's age;
* the `Total_Trans_Amt` and `Total_Trans_Ct` variables, that show a correlation of 0.81;
* the `Total_Revolving_Bal` and `Avg_Utilization_Ratio` variables, that are also highly correlated with a correlation of 0.62.
As we can see from the plot below the plot of the two totally correlated variables `Avg_Open_To_Buy` and `Credit_Limit` shows a seemingly perfect linear relationship, which we found quite anomalous.\
```{r avg_open_and_credit_line, echo=FALSE, fig.width=8, fig.height=8}
# Plot of Avg_Open_To_Buy vs Credit_Limit
ggplot(bank, aes(x = Credit_Limit, y = Avg_Open_To_Buy, color = as.factor(Attrition_Flag))) +
geom_point(alpha = .5) +
scale_color_manual(values = c("0" = "royalblue", "1" = "#FF5733"),
name = "Attrition Flag:",
labels = c("Existing", "Attrited")) +
labs(x = "Credit Limit", y = "Average Open to Buy") +
theme(legend.position = c(.85, .15),
legend.background = element_rect(fill = "transparent"),
legend.title = element_text(size = 10),
aspect.ratio = 1)
```
Since both the `Total_Trans_Amt` and `Total_Trans_Ct` variables are highly correlated between each other but, contrairly to the other variables, also show a significant correlation with the response variable, it made sense to plot the two variables in the following scatter plot to check for possible patterns.
```{r transactions_scatterplot, echo=FALSE, fig.width=8, fig.height=8}
# Plot of Total_Trans_Ct vs Total_Trans_Amt
ggplot(bank, aes(x = log(Total_Trans_Amt), y = Total_Trans_Ct, color = as.factor(Attrition_Flag))) +
geom_point(alpha = .5) +
scale_color_manual(values = c("0" = "royalblue", "1" = "#FF5733"),
name = "Attrition Flag:",
labels = c("Existing", "Attrited")) +
labs(x = "Total Transaction Amount", y = "Total Transaction Count") +
theme(legend.position = c(.85, .15),
legend.background = element_rect(fill = "transparent"),
legend.title = element_text(size = 10),
aspect.ratio = 1)
```
The scatter plot represents the total number of transactions (on the y axis) agains the logarithm of the total transaction amount (on the x axis). The plot is colored by the `Attrition_Flag` variable, and we can indeed visually confirm the correlation among the two variables. However we can also spot that the churned customers are more concentrated in the lower left part of the plot.\
This fact is also reflected in the distribution analysis of the two variables reported below.
```{r trans_amt_distribution, echo=FALSE, fig.width=8, fig.height=8}
plot_continuous(bank, Total_Trans_Amt, "Total Transaction Amount", "Amount", 1000)
```
```{r trans_ct_distribution, fig.width=8, fig.height=8}
# plot_discrete(bank, Total_Trans_Ct, "Total Transaction Count", "Count")
plot_continuous(bank, Total_Trans_Ct, "Total Transaction Count", "Count", 5)
```
# Models implementation
Different models has been built to predict the `Attrition_Flag` variable and are presented in this section.
In order to assess the model performance, different effectiveness metrics have been used. These have been computed both by fitting the model using all the observations in the dataset and also by performing a k-fold cross validation with $k=10$. The metrics used are the following:
* Accuracy
* AUC
* FPR (False Positive Rate)
* FNR (False Negative Rate)
* Confusion matrix (*only for a single static split*)
The Dummy classifier has been taken as a baseline for comparison.
## Logistic regression
The first model we build in the attempt of predicting the response variable has been a logistic regression model using the $logit$ link function.\
In the initial model we included all the covariates except those that had low correlation with the response variable. We further filtered the covariates selection by discarding the variables that had a $p-value$ higher than 0.05 in the `glm()` summary output, i.e. the variables that were not statistically significant in the prediction.\
The final model has been built using the following variables:
* `Gender`: the gender of the customer
* `Marital_Status`: the marital status of the customer
* `Income_Category`: the income category of the customer
* `Total_Relationship_Count`: the total number of products held by the customer
* `Months_Inactive_12_mon`: the number of months inactive in the last 12 months
* `Contacts_Count_12_mon`: the number of contacts in the last 12 months
* `Total_Revolving_Bal`: the total revolving balance on the credit card
<!-- * `Total_Trans_Amt`: the total transaction amount in the last 12 months -->
* `Total_Trans_Ct`: the total transaction count in the last 12 months
* `Total_Ct_Chng_Q4_Q1`: the change in transaction count from Q4 to Q1
decided to remove the `Total_Trans_Amt` variable to avoid possible multicollinearity problems.\
Aditionally, for this particular model, all of the numerical variables have been standardized to have a mean of 0 and a standard deviation of 1. Some covariates have also been slightly modified to improve the model performance. Among these we mention that:
<!-- * looking at its data distribution it has been taken the logarihmic values of the `Total_Trans_Amt` variable; -->
* the `Marital_Status` variable has been converted into a binary variable, taking the levels "Married" and "Not Married" since this change helped improving the statitical significane of the variable in the final model.
* the variable `Income_Category` has also been converted into a binary variable, dividing the clients in two categories: those with an income less than 120K and those with an income more than 120K.\
All of these changes have significantly improved the model performance as portrayed by the metrics explained below. An ANOVA test has also been performed to check the significance of the model.
The results of the ANOVA test are the following:
```{r logistic_model_anova}
anova(simple_logistic_model, test = "Chisq")
```
As its possible to see all the covariates contribute in decreasing the deviance of the model, and they are all statistically significant. Looking at these results it's surely possible to see that the contribution in the deviance decrease of the two categorical variables `Marital_Status` and `Income_Category` is lower than the other variables, so an attempt to simplify the model by removing these 2 variables can be done. However, both for comparisons purposes and for the sake of the model performance, we decided to keep these variables in the model.\
On the other hand, possible multicollinearity problems, especially among the variables that showed high correlation in the exploratory analysis, were checked by looking at the Variance Inflation Factor (VIF) of the covariates.\
```{r logistic_model_build, echo=FALSE}
vif(simple_logistic_model)
```
<!-- The covariates we were more worried of were the `Total_Trans_Ct` and `Total_Trans_Amt` since they showed a high correlation in the exploratory analysis. However, the VIF values are all below 10, which is a common threshold to consider multicollinearity as a problem. For this reason we decided to keep these variables since they indeed helped improving the model.\ -->
### Results on the original dataset
On the given dataset, the built logistic model summary is the following:
```{r logistic_model_summary}
summary(simple_logistic_model)
```
<!-- The main thing we can notice from this summary is that, as expected, the two most impactful covariates are the `Total_Trans_Ct` and the `Total_Trans_Amt`. Surprisingly, their effect seems to be opposite, in particular: -->
The main thing we can notice from this summary is that, as expected, the most impactful covariate is the `Total_Trans_Ct`.This has a negative coefficient, which means that the higher the total transaction count, the lower the probability of a customer to churn, just as we noticed from the original plot.
From this we might deduce that the bank is able to satisfy clients that perform a high volume of transactions.\
Additionally, the intercept of the model also has a significant impactwith a negative coefficient. This can be interpreted only if it makes sense for all of the covariates to be equal to 0, in this case this basically refers to a newly acquired costumer and indeed it seems that the probability of a new costumer to churn is lower than the probability of an existing costumer to churn.\
For what concerns the effectiveness metrics, we measured the following results on the original dataset:
```{r logistic model_results, message=FALSE, warning=FALSE}
logistic_results <- assess_logistic(simple_logistic_model, bank_logistic)
```
And similar results for the k-fold cross validation:
```{r logistic_model_cv_results, message=FALSE, warning=FALSE}
cv_logistic(bank_logistic)
```
From the results we can clearly see that the model is overall performing well. The accuracy is consistently over $90\%$ on all the folds, but this is not too surprising given the unbalance nature of the dataset. Good values have however been reached for the AUC and the FPR, with the worst metric being the FNR which falls just below $70\%$. This is of course a limit of this simple model and its also related to the unbalanced nature of the dataset. Indeed, as dicussed above, the main advantage of this model is the possiblity to interpret the coefficients of the covariates and to understand the impact of each of them on the response variable.
### Results on the synthetic dataset (ROSE)
As said in the beginning we also tried to balance the dataset using the *ROSE* package.
Let's apply *ROSE* package:
```{r}
bank_logistic_balanced<- ROSE(Attrition_Flag~.,data=bank_logistic,seed = 123)$data
```
Now we can learn a logistic regression model on the new balanced dataset:
```{r}
ROSE_logistic_model <- learn_logistic(bank_logistic_balanced)
```
For what concerns the effectiveness metrics on the static trai test split:
```{r, message=FALSE, warning=FALSE}
ROSE_logistic_results<- assess_logistic(ROSE_logistic_model, bank_logistic_balanced)
```
Instead, by computing a 10-fold CV:
```{r, message=FALSE, warning=FALSE}
cv_logistic(bank_logistic_balanced)
```
**Comment:** the accuracy of this model, as expected, is lower than one of the logistic model learnt on the unbalanced dataset. However, the difference with respect to the dummy classifier becomes definitely more evident. Regarding the AUC score, even its value is slightly lower than before, it still showcases how good is the fit of our model. The metric that really improves after applying ROSE is the FNR: the model learnt on the unbalanced dataset had an average FNR of 44.11%, while now it drops around 17.87%. Conversely, FPR increases from 3.02% to 17.85%, but since we don't know anything about the costs of the errors, we considered it better to have more balanced metrics for FPR and FNR.
## GAM/Splines
We also tried using a General Additive Model to predict the Attrition_Flag variable. At first we included all the variables in the model, using a spline approximation for all the non categorical variables, obtaining an accuracy of 96,42% (AUC 98.99%).
Than we removed from the model the variables not statistically significant and we included as linear variables whose spline approximation has an expected degree of freedom minor or equal to 2 obtaining an accuracy of 96.41% (AUC 98.93%).
The final model has been built using as linear:
* `Customer_Age`: customer age in years
* `Gender`: the gender of the customer
* `Dependent_count`: number of dependents
* `Marital_Status`: married, not married
* `Total_Relationship_Count`: the total number of products held by the customer
* `Months_Inactive_12_mon`: the number of months inactive in the last 12 months
* `Contacts_Count_12_mon`: the number of contacts in the last 12 months
* `Total_Revolving_Bal`: the total revolving balance on the credit card
* `Total_Amt_Chng_Q4_Q1`: change in transaction amount
* `Total_Trans_Amt`: the total transaction amount in the last 12 months
* `Total_Trans_Ct`: the total transaction count in the last 12 months
* `Total_Ct_Chng_Q4_Q1`: the change in transaction count from Q4 to Q1
As in the logistic regression model we converted `Months_Inactive_12_mon` to a categorical variable with 4 categories: 1,2,3,4+; and we took the logarithmic values of `Total_Trans_Amt`.
The assessment has been done using the same procedure as the logistic regression model: we performed a k-fold cross validation with k=10, and we used the same metrics.
```{r gam_model_build, echo=FALSE}
# Set working directory as this directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
# Load the dataset and pre-process it
bank_gam <- read.csv("./datasets/BankChurners.csv", sep = ",")
# ⚠️ Remove the last two columns as suggested in the README
bank_gam <- bank_gam[, -c(22, 23)]
# ⚠️ Remove the first column as it is just an index
bank_gam <- bank_gam[, -1]
# ⚠️ Convert the Attrition_Flag column to a binary variable:
# - 0: Existing Customer
# - 1: Attrited Customer
bank_gam$Attrition_Flag <- ifelse(bank_gam$Attrition_Flag == "Attrited Customer", 1, 0)
# Convert all categorical variables to factors
bank_gam$Gender <- as.factor(bank_gam$Gender)
bank_gam$Income_Category <- fct_collapse(bank_gam$Income_Category,
"Less than 120K" = c("Unknown",
"Less than $40K",
"$40K - $60K",
"$60K - $80K",
"$80K - $120K"),
"More than 120K" = c("$120K +"))
# Chnanging the levels of Marital_Status in either married or not married
bank_gam$Marital_Status <- fct_collapse(bank_gam$Marital_Status,
"Married" = c("Married"),
"Not Married" = c("Divorced",
"Single",
"Unknown"))
# Converting Months_Inactive_12_mon to a factor
bank_gam$Months_Inactive_12_mon <- as.factor(bank_gam$Months_Inactive_12_mon)
levels(bank_gam$Months_Inactive_12_mon) <- c("0", "1", "2", "3", "4", "5", "6+")
# Joining together the levels after 4 months
bank_gam$Months_Inactive_12_mon <- fct_collapse(bank_gam$Months_Inactive_12_mon,
"4+" = c("4", "5", "6+"))
bank_gam$Education_Level <- as.factor(bank_gam$Education_Level)
bank_gam$Marital_Status <- as.factor(bank_gam$Marital_Status)
bank_gam$Income_Category <- as.factor(bank_gam$Income_Category)
bank_gam$Card_Category <- as.factor(bank_gam$Card_Category)
bank_gam$Total_Trans_Amt <- log(bank_gam$Total_Trans_Amt)
```
### Results on the original dataset
The build gamfit model is:
```{r gam_model_summary}
gamfit<-learn_gam(bank_gam)
summary(gamfit)
```
Regarding the effectiveness metrics we obtained:
```{r gam model_results, message=FALSE, warning=FALSE}
gam_results <- assess_gam(gamfit, bank_gam)
```
And similar results for the k-fold cross validation:
```{r gam_model_cv_results, message=FALSE, warning=FALSE}
cv_gam(bank_gam)
```
### Results on the synthetic dataset (ROSE)
First of all, let's obtain the synthetic dataset:
```{r,message=FALSE, warning=FALSE}
bank_gam_balanced<- ROSE(Attrition_Flag~.,data=bank_gam,seed = 123)$data
```
Learning phase of the model:
```{r,message=FALSE, warning=FALSE}
ROSE_gamfit<-learn_gam(bank_gam_balanced)
```
```{r,message=FALSE, warning=FALSE}
ROSE_gam_results<- assess_gam(ROSE_gamfit, bank_gam_balanced)
```
```{r, message=FALSE, warning=FALSE}
cv_gam(bank_gam_balanced)
```
## Decision Trees
Here,we have tried to use the Decision Trees method via classification trees to consider modelling the predictor variable Attrition Flag.\
We started with preprocessing the dataset on similar lines as done before for previous models and converted target variable Attrition flag into \
binary variable (0 and 1) and converted relevant categorical variables into factors.We took the approach of building classification tree model by \
starting with dummy classifier (as a baseline model) followed by full tree and later optimising tree with less variables(reduced tree model),\
k-fold tree and finally with hyperparameter tuned model.
The final model (tuned one) has been built using the following variables:
* `Gender`: the gender of the customer
* `Dependent_count`: total dependents on the customer
* `Marital_Status`: the marital status of the customer
* `Income_Category`: the income category of the customer
* `Card_Category`: the Type of Card (Blue, Silver, Gold, Platinum)
* `Total_Relationship_Count`: the total number of products held by the customer
* `Months_Inactive_12_mon`: the number of months inactive in the last 12 months
* `Contacts_Count_12_mon`: the number of contacts in the last 12 months
* `Credit_Limit`: the credit Limit on the Credit Card
* `Total_Revolving_Bal`: the total revolving balance on the credit card
* `Total_Trans_Amt`: the total transaction amount in the last 12 months
* `Total_Trans_Ct`: the total transaction count in the last 12 months
* `Total_Ct_Chng_Q4_Q1`: the change in transaction count from Q4 to Q1
* `Total_Amt_Chng_Q4_Q1`: the change in transaction amount (Q4 over Q1)
# Pre-processing steps:
```{r}
#install.packages("caret")
# LOADING AND PREPROCESSING ----------------------------------------------------
# Set working directory as this directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
# Load the dataset from the datasets/ folder
bank <- read.csv("./datasets/BankChurners.csv", sep = ",")
#Remove the last two columns as suggested in the README. Remove the first ID column.
bank <- bank[, -c(1, 22, 23)]
# Convert the Attrition_Flag column to a binary variable:
# - 0: Existing Customer
# - 1: Attrited Customer
bank$Attrition_Flag <- ifelse(bank$Attrition_Flag == "Attrited Customer", 1, 0)
# Convert all categorical variables to factors
bank$Attrition_Flag <- as.factor(bank$Attrition_Flag)
bank$Gender <- as.factor(bank$Gender)
bank$Education_Level <- as.factor(bank$Education_Level)
bank$Marital_Status <- as.factor(bank$Marital_Status)
bank$Income_Category <- as.factor(bank$Income_Category)
bank$Card_Category <- as.factor(bank$Card_Category)
#Classification tree
#Libraries ---------------------------------------------------------------
library(MASS)
library(rpart)
library(rpart.plot)
library(caret)
library(pROC)
library(car) # for checking vif for multicollinearity.
library(ggplot2)
# Perform partition on entire data
set.seed(123)
trainIndex <- createDataPartition(bank$Attrition_Flag, p = 0.8, list = FALSE)
train_data <- bank[trainIndex, ]
validation_data <- bank[-trainIndex, ]
# Classification tree for dummy classifier
# Dummy classifier: Predict the majority class for all instances
majority_class_train <- levels(train_data$Attrition_Flag)[which.max(table(train_data$Attrition_Flag))]
train_data$dummy_predictions <- rep(majority_class_train, nrow(train_data))
majority_class_validation <- levels(validation_data$Attrition_Flag)[which.max(table(validation_data$Attrition_Flag))]
validation_data$dummy_predictions <- rep(majority_class_validation, nrow(validation_data))
# Plot the dummy tree (for visualization purposes)
dummy_tree <- rpart(Attrition_Flag ~ dummy_predictions, data = train_data, method = "class")
rpart.plot(dummy_tree, extra=1, digits=4, box.palette="auto")
```
Dummy tree here is shown as a single node which predicts the majority class for all instances
Dummy Tree metrics:
```{r}
predictions_dummy <- predict(dummy_tree, type = "class", newdata = validation_data)
#Performance indices:
# Calculate accuracy
accuracy_dummy <- confusionMatrix(predictions_dummy, validation_data$Attrition_Flag)$overall['Accuracy']
# Calculate precision
precision_dummy <- confusionMatrix(predictions_dummy, validation_data$Attrition_Flag)$byClass['Precision']
# Calculate recall
recall_dummy <- confusionMatrix(predictions_dummy, validation_data$Attrition_Flag)$byClass['Sensitivity']
# Calculate specificity
specificity_dummy <- confusionMatrix(predictions_dummy, validation_data$Attrition_Flag)$byClass['Specificity']
#Calculate FPR
fpr_dummy <- round(1-specificity_dummy,2)*100
#Calculate FNR
fnr_dummy <- round(1-recall_dummy,2)*100
# Calculate F1 score
f1_score_dummy <- confusionMatrix(predictions_dummy, validation_data$Attrition_Flag)$byClass['F1']
# Calculate AUC-ROC
roc_dummy <- roc(validation_data$Attrition_Flag, as.numeric(predictions_dummy))
auc_roc_dummy <- auc(roc_dummy)
original_performance_dummy <- c(Accuracy = round(accuracy_dummy*100,digits=2), Precision = round(precision_dummy*100,digits=2), Recall = round(recall_dummy*100,digits=2),
FPR=fpr_dummy, FNR=fnr_dummy,Specificity = round(specificity_dummy*100,digits=2), F1_Score = round(f1_score_dummy*100,digits=2), AUC_ROC = round(auc_roc_dummy*100,digits=2))
original_performance_dummy
train_data<- subset(train_data, select = -dummy_predictions)
validation_data<- subset(validation_data, select = -dummy_predictions)
```
Considering dummy tree above as baseline we further moved towards obtaining optimum model.
# Classification tree considering all predictors and response = Attrition_Flag
# Full tree with plot
```{r}
fit <- rpart(Attrition_Flag ~ ., method="class", data = train_data)
rpart.plot(fit, extra=1, digits=4, box.palette="auto")
predictions <- predict(fit, type = "class", newdata = validation_data)
#table(predictions)
```
#Performance indices of full tree:
```{r}
# Calculate accuracy
accuracy <- confusionMatrix(predictions, validation_data$Attrition_Flag)$overall['Accuracy']
# Calculate precision
precision <- confusionMatrix(predictions, validation_data$Attrition_Flag)$byClass['Precision']
# Calculate recall
recall <- confusionMatrix(predictions, validation_data$Attrition_Flag)$byClass['Sensitivity']
# Calculate specificity
specificity <- confusionMatrix(predictions, validation_data$Attrition_Flag)$byClass['Specificity']
#Calculate FPR
fpr <- round(1-specificity,2)*100
#Calculate FNR
fnr <- round(1-recall,2)*100
# Calculate F1 score
f1_score <- confusionMatrix(predictions, validation_data$Attrition_Flag)$byClass['F1']
# Calculate AUC-ROC
roc <- roc(validation_data$Attrition_Flag, as.numeric(predictions))
auc_roc <- auc(roc)
original_performance <- c(Accuracy = round(accuracy*100,digits=2), Precision = round(precision*100,digits=2), Recall = round(recall*100,digits=2),
Specificity = round(specificity*100,digits=2),FPR=fpr,FNR=fnr,F1_Score = round(f1_score*100,digits=2), AUC_ROC = round(auc_roc*100,digits=2))
original_performance
```
# Now, let's consider the logistic regression to decide the significant variables from glm based on p-values
# we ran LR model to select statistically significant variables and then use those variables for classification tree in a reduced tree
model
```{r}
lr1 <- glm(Attrition_Flag~., data = bank, family = binomial)
summary(lr1)
```
# Summary above shows NA values also. This is case when alias are present. Let's check alias.
```{r}
alias(lr1)
```
# The variable "Avg_Open_To_But" is alias. We will remove this and run LR.
```{r}
bank_no_alias <- bank[, -15] # creating data removing alias for running logistic regression.
lr2 <- glm(Attrition_Flag~., data = bank_no_alias, family = binomial)
summary(lr2)
```
# From the summary of lr2 model, the significant variables are:
#Gender, Dependent_count, Marital_Status, Income_Category, Card_Category, Total_Relationship_Count,
#Months_Inactive_12_mon, Contacts_Count_12_mon, Credit_Limit,Total_Revolving_Bal,
#Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1
## further check multicollinearity and Check VIF for the new model
```{r}
vif_values <- car::vif(lr2)
print(vif_values)
```
# There doesn't seems to be multicollinearity issue. so now we can proceed to building reduced tree model
# clasification tree - reduced model, selected from lr2 model and with plot
```{r}
train_data <- train_data[,c("Attrition_Flag", "Gender", "Dependent_count", "Marital_Status", "Income_Category", "Card_Category",
"Total_Relationship_Count", "Months_Inactive_12_mon", "Contacts_Count_12_mon", "Credit_Limit",
"Total_Revolving_Bal", "Total_Amt_Chng_Q4_Q1", "Total_Trans_Amt","Total_Trans_Ct","Total_Ct_Chng_Q4_Q1")]
validation_data <- validation_data[,c("Attrition_Flag", "Gender", "Dependent_count", "Marital_Status", "Income_Category", "Card_Category",
"Total_Relationship_Count", "Months_Inactive_12_mon", "Contacts_Count_12_mon", "Credit_Limit",
"Total_Revolving_Bal", "Total_Amt_Chng_Q4_Q1", "Total_Trans_Amt","Total_Trans_Ct","Total_Ct_Chng_Q4_Q1")]
fit1 <- rpart(Attrition_Flag ~ Gender + Dependent_count + Marital_Status + Income_Category + Card_Category + Total_Relationship_Count +
Months_Inactive_12_mon + Contacts_Count_12_mon + Credit_Limit + Total_Revolving_Bal +
Total_Amt_Chng_Q4_Q1 + Total_Trans_Amt + Total_Trans_Ct + Total_Ct_Chng_Q4_Q1, method="class", data = train_data)
rpart.plot(fit1, extra=1, digits=4, box.palette="auto")
predictions_rm <- predict(fit1, type = "class", newdata = validation_data)
```
# Plot helping to choose complexity parameter for hyperparamter tuning
```{r}
plotcp(fit1) #gives cp parameters later to be chosen for hyperparameter tuning
```
#Performance indices of reduced model:
```{r}
# Calculate accuracy
accuracy_rm <- confusionMatrix(predictions_rm, validation_data$Attrition_Flag)$overall['Accuracy']
# Calculate precision
precision_rm <- confusionMatrix(predictions_rm, validation_data$Attrition_Flag)$byClass['Precision']
# Calculate recall
recall_rm <- confusionMatrix(predictions_rm, validation_data$Attrition_Flag)$byClass['Sensitivity']
# Calculate specificity
specificity_rm <- confusionMatrix(predictions_rm, validation_data$Attrition_Flag)$byClass['Specificity']
#Calculate FPR
fpr_rm <- round(1-specificity_rm,2)*100
#Calculate FNR
fnr_rm <- round(1-recall_rm,2)*100
# Calculate F1 score
f1_score_rm <- confusionMatrix(predictions_rm, validation_data$Attrition_Flag)$byClass['F1']
# Calculate AUC-ROC
roc_rm <- roc(validation_data$Attrition_Flag, as.numeric(predictions_rm))
auc_roc_rm <- auc(roc_rm)
rm_performance <- c(Accuracy = round(accuracy_rm*100,digits=2), Precision = round(precision_rm*100,digits=2), Recall = round(recall_rm*100,digits=2),
Specificity = round(specificity_rm*100,digits=2),FPR=fpr_rm,FNR=fnr_rm, F1_Score = round(f1_score_rm*100,digits=2), AUC_ROC = round(auc_roc_rm*100,digits=2))
rm_performance
```
#k-fold cross validation (with kfold as 10 )
```{r}
ctrl <- trainControl(method = "cv", # Use k-fold cross-validation
number = 10) # Number of folds (e.g., 10-fold)
# Perform k-fold cross-validation
set.seed(123)
cv <- train(Attrition_Flag ~ Gender + Dependent_count + Marital_Status + Income_Category + Card_Category + Total_Relationship_Count +
Months_Inactive_12_mon + Contacts_Count_12_mon + Credit_Limit + Total_Revolving_Bal +
Total_Amt_Chng_Q4_Q1 + Total_Trans_Amt + Total_Trans_Ct + Total_Ct_Chng_Q4_Q1, data = train_data, method = "rpart", trControl = ctrl)
# View the cross-validation results
#print(cv)
predictions_kfold <- predict(cv, type = "raw", newdata = validation_data)
```
#Performance indices of k-fold based Tree:
```{r}
# Calculate accuracy
accuracy_kfold <- confusionMatrix(predictions_kfold, validation_data$Attrition_Flag)$overall['Accuracy']
# Calculate precision
precision_kfold <- confusionMatrix(predictions_kfold, validation_data$Attrition_Flag)$byClass['Precision']
# Calculate recall
recall_kfold <- confusionMatrix(predictions_kfold, validation_data$Attrition_Flag)$byClass['Sensitivity']
# Calculate specificity
specificity_kfold <- confusionMatrix(predictions_kfold, validation_data$Attrition_Flag)$byClass['Specificity']
#Calculate FPR
fpr_kfold <- round(1-specificity_kfold,2)*100
#Calculate FNR
fnr_kfold <- round(1-recall_kfold,2)*100
# Calculate F1 score
f1_score_kfold <- confusionMatrix(predictions_kfold, validation_data$Attrition_Flag)$byClass['F1']
# Calculate AUC-ROC
roc_kfold <- roc(validation_data$Attrition_Flag, as.numeric(predictions_kfold))
auc_roc_kfold <- auc(roc_kfold)
kfold_performance <- c(Accuracy = round(accuracy_kfold*100,digits=2), Precision = round(precision_kfold*100,digits=2), Recall = round(recall_kfold*100,digits=2),
Specificity = round(specificity_kfold*100,digits=2),FPR=fpr_kfold,FNR=fnr_kfold, F1_Score = round(f1_score_kfold*100,digits=2), AUC_ROC = round(auc_roc_kfold*100,digits=2))
kfold_performance
```
#Overall Considerations at this stage:
The full tree generally performs slightly better than the reduced model tree in most metrics, indicating that the additional variables in the full tree contribute\
to improved performance.Both the full tree and reduced model tree outperform the k-fold tree across most metrics, suggesting that the k-fold tree might have\
slightly reduced predictive power or stability.we can consider the trade-offs between model complexity, interpretability, and performance when deciding between\
the full tree and the reduced model tree.we can also evaluate whether the observed differences in metrics are practically significant and align with the \
business goals. to try further we can continue with the iterative process, and consider further steps such as hyperparameter tuning or exploring \
alternative algorithms based on these comparisons.So,next we try tuning a tree of reduced model with hyperparameters.
#Hyperparameter tuning with complexity parameter considered as cp=0.011 and the plot
```{r}
ctrl <- rpart.control(minsplit = 4,
minbucket = round(5 / 3),
maxdepth = 3,
cp = 0.011)
# Perform tuning model
fit_tune <- rpart(Attrition_Flag ~ Gender + Dependent_count + Marital_Status + Income_Category + Card_Category + Total_Relationship_Count +
Months_Inactive_12_mon + Contacts_Count_12_mon + Credit_Limit + Total_Revolving_Bal +
Total_Amt_Chng_Q4_Q1 + Total_Trans_Amt + Total_Trans_Ct + Total_Ct_Chng_Q4_Q1, method = "class", data = train_data, control = ctrl)
# Evaluate performance
#print(fit_tune)
rpart.plot(fit_tune, extra=1, digits=4, box.palette="auto")
#print(fit_tune)
predictions_tune <- predict(fit_tune, newdata = validation_data, type = "class")
```
#Performance indices of Tuned Model:
```{r}
# Calculate accuracy
accuracy_tune <- confusionMatrix(predictions_tune, validation_data$Attrition_Flag)$overall['Accuracy']
# Calculate precision
precision_tune <- confusionMatrix(predictions_tune, validation_data$Attrition_Flag)$byClass['Precision']
# Calculate recall
recall_tune <- confusionMatrix(predictions_tune, validation_data$Attrition_Flag)$byClass['Sensitivity']
# Calculate specificity
specificity_tune <- confusionMatrix(predictions_tune, validation_data$Attrition_Flag)$byClass['Specificity']
fpr_tune <- 1-specificity_tune
fnr_tune <- 1-recall_tune
# Calculate F1 score
f1_score_tune <- confusionMatrix(predictions_tune, validation_data$Attrition_Flag)$byClass['F1']
# Calculate AUC-ROC
roc_tune <- roc(validation_data$Attrition_Flag, as.numeric(predictions_tune))
auc_roc_tune <- auc(roc_tune)
tune_tree_performance <- c(Accuracy = round(accuracy_tune*100, 2), Precision = round(precision_tune*100, 2), Recall = round(recall_tune*100, 2),
Specificity = round(specificity_tune*100, 2), FPR = round(fpr_tune*100, 2), FNR = round(fnr_tune*100, 2), F1_Score = round(f1_score_tune*100,2), AUC_ROC = round(auc_roc_tune*100, 2))
tune_tree_performance
```
# Checking balance in response variable which shows quite imbalance in Atrrition flag variable
```{r}
table(bank$Attrition_Flag)
```
# There is high imbalance in data. Applying ROSE() to get balanced data
```{r}
bank_ROSE <- bank[,c("Attrition_Flag", "Gender", "Dependent_count", "Marital_Status", "Income_Category", "Card_Category",
"Total_Relationship_Count", "Months_Inactive_12_mon", "Contacts_Count_12_mon", "Credit_Limit",
"Total_Revolving_Bal", "Total_Amt_Chng_Q4_Q1", "Total_Trans_Amt","Total_Trans_Ct","Total_Ct_Chng_Q4_Q1")]
#install.packages("ROSE")
library(ROSE)
bank_balanced<- ROSE(Attrition_Flag ~ ., data=bank_ROSE, seed = 123)$data
table(bank_balanced$Attrition_Flag)
```
#Now we can run a classification tree model on the new balanced dataset:
```{r}
ROSE_tree_fit <- rpart(Attrition_Flag ~ ., method="class", data = bank_balanced)
rpart.plot(ROSE_tree_fit, extra=1, digits=4, box.palette="auto")
predictions_ROSE <- predict(ROSE_tree_fit, type = "class", newdata= bank_balanced)
```
#Performance indices of synthetic Dataset:
```{r}
# Calculate accuracy
accuracy_ROSE <- confusionMatrix(predictions_ROSE, bank_balanced$Attrition_Flag)$overall['Accuracy']
# Calculate precision
precision_ROSE <- confusionMatrix(predictions_ROSE, bank_balanced$Attrition_Flag)$byClass['Precision']
# Calculate recall
recall_ROSE <- confusionMatrix(predictions_ROSE, bank_balanced$Attrition_Flag)$byClass['Sensitivity']
fnr_ROSE <- 1-recall_ROSE
# Calculate specificity
specificity_ROSE <- confusionMatrix(predictions_ROSE, bank_balanced$Attrition_Flag)$byClass['Specificity']
fpr_ROSE <- 1 - specificity_ROSE
# Calculate F1 score
f1_score_ROSE <- confusionMatrix(predictions_ROSE, bank_balanced$Attrition_Flag)$byClass['F1']
# Calculate AUC-ROC
roc_ROSE <- roc(bank_balanced$Attrition_Flag, as.numeric(predictions_ROSE))
auc_roc_ROSE <- auc(roc_ROSE)
ROSE_performance <- c(Accuracy = round(accuracy_ROSE*100, 2), Precision = round(precision_ROSE*100, 2), Recall = round(recall_ROSE*100, 2),
Specificity = round(specificity_ROSE*100, 2), FPR = round(fpr_ROSE*100,2), FNR = round(fnr_ROSE*100, 2), F1_Score = round(f1_score_ROSE*100, 2), AUC_ROC = round(auc_roc_ROSE*100, 2))
ROSE_performance
```
# Plotting of performance indices for different trees
```{r}
all_trees <- data.frame(
Method = rep(c("Dummy_Tree","Full_tree","Reduced_tree","kfold_tree","Tuned_tree", "ROSE_tree"), each = 8),
Metric = rep(c("Accuracy", "Precision","Recall", "Specificity","FPR","FNR", "F1_Score", "AUC_ROC"), times = 6),
Value = c(original_performance_dummy,original_performance, rm_performance, kfold_performance, tune_tree_performance, ROSE_performance)
)
ggplot(all_trees, aes(x = Method, y = Value, fill = Method)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~Metric, scales = "free_y") +
labs(title = "Comparison of Classification Tree Methods",
x = "Method", y = "Value") +
theme_minimal()
```
## Overall Considerations after tuning model
The tuned model shows improvements in precision, recall, and F1 score compared to the k-fold tree, indicating a more balanced and accurate model. The specificity of the tuned model is lower than that of the full tree and reduced model tree.we can consider the implications for identifying non-churned customers in terms of business context.The decision on whether to choose one of the existing models or explore a different approach, such as using a Random Forest, depends on several factors:
Performance Goals:
we have to assess how well each model aligns with the performance goals of the business problem. If one of the models consistently outperforms others across key metrics important for the business, it may be a strong candidate.
Interpretability:
we also have to consider the interpretability of the models. Decision trees are inherently interpretable, and if interpretability is crucial, the reduced model tree might be preferred. Random Forests, being an ensemble method, provide powerful predictive capabilities but are generally less interpretable.
Resource Constraints:
we also know that Random Forests, being an ensemble method, are computationally more intensive compared to individual decision trees
## Ensemble Methods
The last class of methods we used to model the `Attrition_Flag` variable is ensamble methods, in particular we focussed on AdaBoost and Random Forest.\
The first model we built is an AdaBoost model on the whole dataset, which on its own achieved some great results. Then we proceeded by removing the variables considered not statistically significant in more complex models like `GAM`. This slight modification didn't much modify the accuracy of the model, which is always on the order of about $\sim90\%$.\
The final model has been built using the following variables:
* `Gender`: the gender of the customer
* `Total_Relationship_Count`: the total number of products held by the customer
* `Months_Inactive_12_mon`: the number of months inactive in the last 12 months
* `Contacts_Count_12_mon`: the number of contacts in the last 12 months
* `Total_Revolving_Bal`: the total revolving balance on the credit card
* `Total_Trans_Amt`: the total transaction amount in the last 12 months
* `Total_Trans_Ct`: the total transaction count in the last 12 months
* `Total_Ct_Chng_Q4_Q1`: the change in transaction count from Q4 to Q1
* `Marital_Status`: the marital status of the customer
* `Income_Category`: the income category of the customer
The same pre-processing steps of the `GLM` model were performed, slightly modifying classes or taking logarithmic values and converting the `Months_Inactive_12_mon` variable to a factor.\
The effectiveness metrics used to assess the model are:
* Accuracy
* AUC
* FPR (False Positive Rate)
* FNR (False Negative Rate)
* Confusion matrix
* Variable importance
The dummy classifier has been taken as a baseline for comparison.
```{r ada_model_build, echo=FALSE}
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
# Load the dataset and pre-process it
bank_ensamble <- read.csv("datasets/BankChurners.csv", sep = ",")
bank_ensamble <- bank_ensamble[, -c(1, 22, 23)]
#bank$Attrition_Flag <- ifelse(bank$Attrition_Flag == "Attrited Customer", 1, 0)
# Convert Attrition_Flag to a binary factor
bank_ensamble$Attrition_Flag <- factor(bank_ensamble$Attrition_Flag == "Attrited Customer", levels = c(FALSE, TRUE))
# If "Attrited Customer" is TRUE, it will be coded as 1, and other values will be coded as 0
# Convert all categorical variables to factors and reorder the levels
bank_ensamble$Gender <- as.factor(bank_ensamble$Gender)
bank_ensamble$Income_Category <- fct_collapse(bank_ensamble$Income_Category,
"Less than 120K" = c("Unknown",
"Less than $40K",
"$40K - $60K",
"$60K - $80K",
"$80K - $120K"),
"More than 120K" = c("$120K +"))
# Changing the levels of Marital_Status in either married or not married
bank_ensamble$Marital_Status <- fct_collapse(bank_ensamble$Marital_Status,
"Married" = c("Married"),
"Not Married" = c("Divorced",
"Single",
"Unknown"))
# Converting Months_Inactive_12_mon to a factor
bank_ensamble$Months_Inactive_12_mon <- as.factor(bank_ensamble$Months_Inactive_12_mon)
levels(bank_ensamble$Months_Inactive_12_mon) <- c("0", "1", "2", "3", "4", "5", "6+")
# Joining together the levels after 4 months
bank_ensamble$Months_Inactive_12_mon <- fct_collapse(bank_ensamble$Months_Inactive_12_mon,
"4+" = c("4", "5", "6+"))
bank_ensamble$Education_Level <- as.factor(bank_ensamble$Education_Level)
bank_ensamble$Marital_Status <- as.factor(bank_ensamble$Marital_Status)
bank_ensamble$Income_Category <- as.factor(bank_ensamble$Income_Category)
bank_ensamble$Card_Category <- as.factor(bank_ensamble$Card_Category)
# Override the Total_Trans_Amt variable with its log !!!
bank_ensamble$Total_Trans_Amt <- log(bank_ensamble$Total_Trans_Amt)
bank_ensamble <- bank_ensamble[, -c(2, 4, 5, 8, 9, 13, 15, 16, 20)]
# Actual model building
set.seed(1234)
index <- createDataPartition(bank_ensamble$Attrition_Flag , p =0.8, list = FALSE)
train_bank_ensamble <- bank_ensamble[index,]
test_bank_ensamble <- bank_ensamble[-index,]
```
Learning phase:
```{r learning_boost_rf,message=FALSE,warning=FALSE}
boost_model <- learn_boost(train_bank_ensamble)
rf_model <- learn_rf(train_bank_ensamble)
```
### Results on the original dataset
A summary-like output of the model doesn't exist, but we can still assess the model performance using the metrics described above.
First we show the results relative to a single run of the model on the dataset:
```{r ada_model_summary,message=FALSE,warning=FALSE}
boost_results <- assess_boost(boost_model, test_bank_ensamble)
```
Then the results on a 10-fold cross validation:
```{r ada_model_cv_results,message=FALSE,warning=FALSE}
cv_boost_results <- cv_boost(bank_ensamble)
```
As we can see from these results, AdaBoost is performing pretty well, having a consistent accuracy of over $90\%$ and a good AUC score. The FPR and FNR are also very low, keeping them consistently under $20\%$. Not having an AIC or BIC-like score, we can't really compare the model to the previous ones, but the results on classification are much better than the ones above, in particular with respect to the simple logistic regression.\
Passing to Random Forest, we perform the same analysis:
```{r rf_model_summary,message=FALSE,warning=FALSE}
rf_results <- assess_rf(rf_model, test_bank_ensamble)
```
```{r rf_model_cv_results,message=FALSE,warning=FALSE}
cv_rf_results <- cv_rf(bank_ensamble)
```
The results are very similar to the ones of AdaBoost, with the only difference being a slightly lower accuracy and AUC score. The FPR and FNR are also very low, keeping them consistently under $20\%$. A consistent note should be pointed that on the fact that performing cross validation, the average FNR decreases to below $5\%$, demonstrating the adaptive power of Random Forest also on unbalanced data.\
### Results on the synthetic dataset (ROSE)
From `bank_ensamble` let's obtain a new (synthetic) dataset: