To assist this institution in quantifying the risk of loan non-repayment, five supervised machine learning techniques were employed, each of which output a probability of default for an individual with certain other known characteristics. Information such as an individual’s marital status, age, previous bill and payment amounts, balance limit, and education level achieved were all considered as potential predictors for these models.
For every model, 10-fold cross validation was employed to give an estimate of the misclassification rate. 10-fold cross validation was chosen over LOOCV because it tends to be more accurate and requires less time to compute. Computational time already surpassed 15 minutes for some of the models built with 10-fold cross validation, so any increased computational complexity would come with a substantial tradeoff. Additionally, 10-fold cross validation gives less variable results than does using a single validation set.
Five methods were utilized to predict default: Logistic Regression, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Boosting, and Random Forest. Boosting gave the lowest misclassification rate, so the boosting model was developed and deployed on the testing dataset.
Logistic Regression is a Generalized Linear Model (GLM) that maximizes a likelihood function to estimate coefficients. Logistic Regression was chosen for several reasons: It is easy to implement, offers the needed probabilistic interpretation, and works best when there are only two levels to the response. The misclassification rate for the logistic regression model was 0.2077657. For comparison, the trivial model, which always predicts non-default since it is more common, had a misclassification rate of 0.208, meaning that our model was classified the observations correctly only slightly more often.
LDA and QDA approximate Bayes Classifier, the classifier that always minimizes the error rate but requires underlying knowledge on the conditional distribution of the data that is unknown. Both methods model the distribution of the predictors separately for each class of the response; QDA distinguishes itself from LDA in that it does not assume the predictors are correlated identically between classes, as does LDA. As a result, LDA performs better when the true decision boundary is linear, while QDA generally performs better when the decision boundary is nonlinear. The misclassification rate for the LDA model was 0.2077249, which is still hardly better than the trivial model, although it is slightly better than the rate given by the logistic model. The QDA misclassification rate was 0.6253371, meaning that a completely random classifier is likely to perform better than the QDA model. The extremely high QDA misclassification rate, along with the much lower LDA rate, implies that the true decision boundary is likely close to linear. Since our LDA model outperformed the logistic model as well, the data is likely approximately normally distributed with a common covariance matrix in each class.
Tree-based methods, which are non-formulaic and non-linear, were also utilized to predict default. Random forests bootstrap the data multiple times, building models on each bootstrap, and averaging the results, diminishing the variability therein. Specifying the number of predictors also affects variance, as fewer predictors considered for each bootstrap means that the bootstraps will themselves be less correlated, reducing variance. However, considering too few predictors can lead to high bias. To choose the correct number of predictors to be considered in each bootstrap, m, we simulated models built and cross-validated under various values of m. The value which minimized the misclassification rate, m = 11, was chosen. The misclassification rate for the random forest model was 0.1978958.
Finally, boosting was used to build the final model. Boosting trees grow sequentially, with information given by the previous tree used in the construction of the next tree. The number of trees considered ranged from 50 to 500 in increments of 50. The standard shrinkage values (which control the learning rate of the model) of 0.1, 0.01, and 0.001 were all considered. The number of splits (or interaction depth) considered were 3, 4, and 5. The desired number of observations in each node was specified to be 15.
Every combination of the above shrinkage values, number of trees, and interaction depth was used in building a model. The model with a shrinkage value of 0.1, 500 trees, and 3 splits produced the lowest misclassification rate of any model previously built, at 0.1955303. Because of this, and that more tuning parameters were used in boosting than any other process, this model was selected as the final model.