by Arzu Isik Topbas
I built and evaluated several machine-learning models to predict credit risk using free data from LendingClub. Credit risk is an inherently imbalanced classification problem , so I used different techniques for training and evaluating models with imbalanced classes. I used the imbalanced-learn and Scikit-learn libraries to build and evaluate models using resampling and ensembling learning techniques.
I used the imbalanced learn library to resample the LendingClub data and built and evaluated logistic regression classifiers using the resampled data.
In this section, you compared two oversampling algorithms to determine which algorithm results in the best performance. You will oversample the data using the Naive Random Oversampling algorithm and the SMOTE Oversampling algorithm.
- balanced accuracy score = 0.65
- recall score = 0.65
- geometric mean score = 0.65
- balanced accuracy score = 0.64
- recall score = 0.57
- geometric mean score = 0.63
In this section, I tested an undersampling algorithms - the Cluster Centroids algorithm - to determine which algorithm results in the best performance compared to the oversampling algorithms above.
- balanced accuracy score = 0.78
- recall score = 0.77
- geometric mean score = 0.78
In this section, I tested a combination Over- and Under-Sampling algorithm to determine if the algorithm results in the best performance compared to the other sampling algorithms above by using the SMOTEENN algorithm
- balanced accuracy score = 0.49
- recall score = 0.01
- geometric mean score = 0.08
Which model had the best balanced accuracy score?
- The Cluster Centroids algorithm has the best balanced accuracy score(0.78).
Which model had the best recall score?
- The Cluster Centroids algorithm has the best recall score(0.77).
Which model had the best geometric mean score?
- The Cluster Centroids algorithm has the best geometric mean score(0.78).
I compared two different ensemble classifiers to predict loan risk and evaluate each model. I used the Balanced Random Forest Classifier and the Easy Ensemble Classifier.
- balanced accuracy score = 0.79
- recall score = 0.91
- geometric mean score = 0.78
- the top three features = total_rec_prncp (0.073767), total_rec_int (0.063903), total_pymnt_inv (0.060733)
- balanced accuracy score = 0.93
- recall score = 0.94
- geometric mean score = 0.93
Which model had the best balanced accuracy score?
- The Easy Ensemble Classifier has the best balanced accuracy score(0.93).
Which model had the best recall score?
- The Easy Ensemble Classifier has the best recall score(0.94).
Which model had the best geometric mean score?
- The Easy Ensemble Classifier has the best geometric mean score(0.93).
What are the top three features?
- total_rec_prncp (0.073767)
- total_rec_int (0.063903)
- total_pymnt_inv (0.060733)