NOTE: In the following tables, 0 represent low credit risk and 1 represents high credit risk.
The aim of this project is to determine which classification technique produces the best results when applied to the task of determining credit riskiness.
The following classification models were applied to the data and the statistics of their performance documented:
-
Resampling Techniques
i. Oversampling - Naïve Random Overampling
ii. Oversampling - SMOTE Oversampling
iii. Undersampling
iv. Combination Over- and Undersampling -
Ensemble Learners
i. Balance Random Forest Classifier
ii. Easy Ensemble AdaBoost Classifier
-
Balanced Accuracy score (validation): 0.835
-
Confusion matrix:
-- -- |14502 2602| | 18 83| -- --
- Imbalanced Classification Report:
pre | rec | spe | f1 | geo | iba | sup | |
---|---|---|---|---|---|---|---|
0 | 1.00 | 0.85 | 0.82 | 0.92 | 0.83 | 0.70 | 17104 |
1 | 0.03 | 0.82 | 0.85 | 0.06 | 0.83 | 0.69 | 101 |
avg / total | 0.99 | 0.85 | 0.82 | 0.91 | 0.83 | 0.70 | 17205 |
-
Balanced Accuracy score (validation)*: 0.841
-
Confusion matrix:
-- -- |14883 2221| | 19 82| -- --
- Imbalanced Classification Report:
pre | rec | spe | f1 | geo | iba | sup | |
---|---|---|---|---|---|---|---|
0 | 1.00 | 0.87 | 0.81 | 0.93 | 0.84 | 0.71 | 17104 |
1 | 0.04 | 0.81 | 0.87 | 0.07 | 0.84 | 0.70 | 101 |
avg / total | 0.99 | 0.87 | 0.81 | 0.92 | 0.84 | 0.71 | 17205 |
-
Balanced Accuracy score (validation): 0.824
-
Confusion matrix:
-- -- |13121 3983| | 12 89| -- --
- Imbalanced Classification Report:
pre | rec | spe | f1 | geo | iba | sup | |
---|---|---|---|---|---|---|---|
0 | 1.00 | 0.77 | 0.88 | 0.87 | 0.82 | 0.67 | 17104 |
1 | 0.02 | 0.88 | 0.77 | 0.04 | 0.82 | 0.68 | 101 |
avg / total | 0.99 | 0.77 | 0.88 | 0.86 | 0.82 | 0.67 | 17205 |
-
Balanced Accuracy score (validation): 0.841
-
Confusion matrix:
-- -- |14723 2381| | 18 83| -- --
- Imbalanced Classification Report:
pre | rec | spe | f1 | geo | iba | sup | |
---|---|---|---|---|---|---|---|
0 | 1.00 | 0.86 | 0.82 | 0.92 | 0.84 | 0.71 | 17104 |
1 | 0.03 | 0.82 | 0.86 | 0.06 | 0.84 | 0.70 | 101 |
avg / total | 0.99 | 0.86 | 0.82 | 0.92 | 0.84 | 0.71 | 17205 |
-
Balanced Accuracy score (validation): 0.792
-
Confusion matrix:
-- -- |14212 2904| | 22 67| -- --
- Imbalanced Classification Report:
pre | rec | spe | f1 | geo | iba | sup | |
---|---|---|---|---|---|---|---|
0 | 1.00 | 0.83 | 0.75 | 0.91 | 0.79 | 0.63 | 17116 |
1 | 0.02 | 0.75 | 0.83 | 0.04 | 0.79 | 0.62 | 89 |
avg / total | 0.99 | 0.83 | 0.75 | 0.90 | 0.79 | 0.63 | 17205 |
-
Balanced Accuracy score (validation): 0.902
-
Confusion matrix:
-- -- |15506 1610| | 9 80| -- --
- Imbalanced Classification Report:
pre | rec | spe | f1 | geo | iba | sup | |
---|---|---|---|---|---|---|---|
0 | 1.00 | 0.91 | 0.90 | 0.95 | 0.90 | 0.81 | 17116 |
1 | 0.05 | 0.90 | 0.91 | 0.09 | 0.90 | 0.81 | 89 |
avg / total | 0.99 | 0.91 | 0.90 | 0.95 | 0.90 | 0.81 | 17205 |
Classification method | Accuracy |
---|---|
BalancedRandomForestClassifier | 0.792 |
Undersampling | 0.824 |
Naïve Random Oversampling | 0.835 |
SMOTE Oversampling | 0.841 |
Combination (Over and Under) Sampling | 0.841 |
EasyEnsembleClassifier | 0.902 |
The table above shows the classification methods sorted by the prediction accuracy, in ascending order. It can be observed that the worst accuracy results are achieved with undersampling which, considering that the technique involves shrinking the dataset to a very small number, is not surprising. The best accuracy is achieved by the Ensemble AdaBoost Classifier.
F1 | |
---|---|
Undersampling | 0.04 |
BalancedRandomForestClassifier | 0.04 |
Naïve Random Oversamplingsampling | 0.06 |
Combination (Over and Under) Sampling | 0.06 |
SMOTE Oversampling | 0.07 |
EasyEnsembleClassifier | 0.09 |
The table above shows that classification methods with their F1 scores. In a confirmation of the finding related to accuracy, this table also shows that the best F1 score has been achieved by the Ensemble AdaBoot Classifier.
The classification method that achieves the best results is the Ensemble AdaBoost Classifier.