Machine Learning Project Credit Scorecard Model (Using Logistic Regression) on Home Credit Indonesia
Problem :
The main risk for loan companies is failure to assess credit risk accurately and efficiently. Disadvantage of Manual credit risk assessment :
- Subjectivity can introduce bias and inconsistency in decision-making
- Time-consuming especially when dealing with a large number of loan applications.
- Humans errors, such as data entry mistakes, miscalculations, or oversight of important details
Challenges :
Build a machine learning model that can automatically assess loans
- Predict client’s repayment abilities
- Speed up inspection filing without spending more money
There are 40 columns that have null values. And handling missing value by drop feature that have missing value > 50%, and replace missing values on numerical category with median & categorical with mode.
Split Data Train (80:20).
Categorical & Numerical Selection with criteria :
- Low cardinality (unique)
- No null values
- p-value < 0.05 (using chi square for categorical & ANOVA for numerical)
- Correlation coefficient <= 0.7
Before Feature Selection = 122 columns
After Feature Selection = 16 columns
Weight of Evidence (WOE) & Information Value (IV)
- WOE generally described as a measure of the separation of good and bad customers
- IV helps to rank variables on the basis of their importance.
Drop Feature No Needed :
- IV < 0.02, The variable is Not useful for prediction
- IV> 0.5, The variable is Suspicious Predictive Power
Before Feature Engineering = 16 columns
After Feature Engineering = 14 columns
- 61.23% customers that do not have payment difficulty are female, and 30.70% are Male
- In UK, Women account for 65% of the home credit industry's customers (Bermeo, 2018)
- Recommendation : Start a campaign to encourage more women to apply for credit
- 23.37% customers that do not have payment difficulty are laborers, then followed by staff and managers.
- Recommendation : Start a campaign to encourage more laborers, staff, and managers to apply for credit
- Mean AUROC of 0.7304 is generally considered good, indicating that the logistic regression model is effective at distinguishing between the positive and negative classes.
- Based on (Trifonova, 2012) An AUC - ROC 0.7–0.8 is considered good.
- Gini coefficient of 0.4608 indicates a relatively strong separation between the model's performance and random chance.
- It suggests that the logistic regression model has a good discriminatory ability.
- Based on (Teng, 2011) Gini coefficient 0.4 - 0.5 considered big gap.
- Base (Intercept) = 555
- Min Score = 300 (FICO)
- Max Score = 850 (FICO)
- Precision = Out of all the loan status that the model predicted would get good loan, only 96% actually did.
- Recall = Out of all the loan status that actually did get good loan, the model only predicted this outcome correctly for 68% ofthose loan status.
- F1 Score = 0.8. F1 score of 0.7 or higher is often considered good (spotintelligence.com, 2023) The accuracy is not really good because we've got 0.68 out of 1
62.46% got correct variable of good loan
- Best threshold = 0.353918 (Using Youden J-Statistic)
- Best threshold is used to minimized the False Positive Rate and maximize the True Positive Rate
- Precision = Out of all the loan status that the model predicted would get good loan, only 94% actually did.
- Recall = Out of all the loan status that actually did get good loan, the model only predicted this outcome correctly for 88% ofthose loan status.
- F1 Score = 0.9. F1 score of 0.7 or higher is often considered good (spotintelligence.com, 2023) The accuracy increased significantly from 0.68 to 0.84
80.44% got correct variable of good loan.
- Choosing a 0.5 threshold might mean rejecting a lot of applicants with rejection rate 34%, which could lead to losing business
- With best threshold, we've got rejection rate 14%
- So, we've decided to keep our preferred threshold = 0.353918 and Credit Score of 516
Partial Auto Reject & Auto Approve
- If a submission seems bad, it is rejected right away.
- If a submission appears to be very good, it is accepted immediately.
- If there's uncertainty, it is manually checked by the assessment team.
Create targeted campaign
- We should launch additional campaigns targeting women, laborers, staff, and managers to encourage them to apply for credit.