In this project data pre-processing is employed to handle a dataset that is peppered with problems, like missing values, explicit duplicates, implicit duplicates, categorisation issues, etc.
With the correct approaches, each is issued handled that in the end enables analysis that bears conclusions and answers to the hypothesis.
First, missing values are replaced by median that is based on other related, observed factors.
Second, explicit duplicates are droped.
Third, implicit duplicates are handled by using all lowercase.
Fourth, data type is changed to suitable one.
Fifth, unreasonable and implausible values are replaced with the reasonable ones.
Lastly, the hypothesis is adddressed. Conclusions drawn are:
- There is no effect of the number of children had to the timeliness of repayment
- Civil partnership and unmarried marital status have a greater possibilities to default
- The greater the income level, the smaller the probability of default
- Loans for the purpose of buying cars and education has a greater probability to default.