Skip to content

Predict the probability of default for each user id in risk modeling

License

Notifications You must be signed in to change notification settings

nvlinhvn/default-modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Problem Definition

  • predict the probability of default for each user id in risk modeling
  • default = 1 means defaulted users, default = 0 means otherwise
  • Imbalance binary classification problem

Expected Workflow

title

Variables (total = 43):

  • uuid: text User Id
  • default: (or target) boolean (0, or 1)
  • Categorical, and numerical features are defined in default_modeling.utils.preproc.pyx (function feature_definition)

Adjustment:

  • If you want to run the experiment with your data for the purpose of binary classification:
    • Replace csv in both train_data and test_data by your csv. (Optional: also change test file test_sample_1.csv in default_modeling/default_modeling/tests/data/ for unit test). Each row of your csv should correspond to unique User ID .
    • Redefine categorical, numerical features in default_modeling/default_modeling/utils/preproc.pyx (function feature_definition) based on your definition
    • Change TARGET=default in Dockerfile to TARGET={your target variable}
    • Data example can be seen below
UUID (User id) Feature 1 ... Feature N Target (binary)
001 100 ... "AAA" 0
002 300 ... "BBB" 1

Package Requirements:

  • pandas, numpy, category_encoders, sklearn, scipy, joblib, Cython

Folder Structure

.
├── Dockerfile
├── default_modeling
│   ├── __init__.py
│   ├── default_modeling
│   │   ├── __init__.py
│   │   ├── interface
│   │   │   ├── __init__.py
│   │   │   ├── launch_predictor.py
│   │   │   ├── launch_trainer.py
│   │   │   ├── predictor.c
│   │   │   ├── predictor.pyx
│   │   │   ├── trainer.c
│   │   │   └── trainer.pyx
│   │   └── utils
│   │       ├── __init__.py
│   │       ├── load.c
│   │       ├── load.pyx 
│   │       ├── preproc.c
│   │       └── preproc.pyx
│   ├── setup.py
│   ├── tests
│   │   ├── __init__.py
│   │   ├── data
│   │   │   └── test_sample_1.csv
│   │   ├── test_case_base.py
│   │   └── test_data_handling.py
├── model
│   └── risk_model.joblib
├── prototype
│   ├── prototype_cython.ipynb
│   └── prototype_python.ipynb
├── requirements.txt
├── test_data
│   ├── test_set_1.csv
│   └── test_set_2.csv
└── train_data
    ├── train_set_1.csv
    └── train_set_2.csv

Run Locally

After cloning this repository, SETUP

!python3 -m default_modeling.setup build

UNIT TEST

!python3 -m unittest discover default_modeling

Arguments Explaination

  • model-dir: folder to store trained model (model as seen in this repo)
  • model-name: name of trained .joblib model (risk_model saved in folder model in this case)
  • train-folder: folder contains train csv (train_data in this repo)
  • train-file: selected file in train-folder (train_set_1.csv in this case)
  • target: target columns from data
  • test-folder: folder contains test csv (test_data in this repo)
  • test-file: selected file in test-folder (test_set_1.csv in this case)
  • Random forest parameters as sklearn.RandomForestClassifier:
    • n-estimators
    • max-depth
    • min-samples-leaf
    • random-state

TRAIN (train file train_data/train_set_1.csv)

!python3 -m default_modeling.default_modeling.interface.launch_trainer \
                                                --model-dir ./model \
                                                --model-name risk_model \
                                                --train-folder train_data \
                                                --train-file train_set_1.csv \
                                                --target default

Now if we would like to tune or modify random forest hyperparameters in training.

!python3 -m default_modeling.default_modeling.interface.launch_trainer \
                                                --model-dir ./model \
                                                --model-name risk_model \
                                                --train-folder train_data \
                                                --train-file train_set_1.csv \
                                                --target default \
                                                --n-estimators 200 \
                                                --max-depth 15 \
                                                --min-samples-leaf 20

PREDICT (predict file test_data/test_set_1.csv)

!python3 -m default_modeling.default_modeling.interface.launch_predictor \
                                               --test-file test_set_1.csv \
                                               --model-dir ./model \
                                               --model-name risk_model \
                                               --test-folder test_data \
                                               --test-file test_set_1.csv \
                                               --target default

DockerFile Contents

  • My Local Working Directory named /home/jupyter. In this local working directory:
    • train_data folder contains different files for training random forest classifers
    • model folder store the trained .joblib random forest, and the model will be loaded in this folder for prediction
    • test_data folder contains new data coming and waiting for prediction, prediction result will be locally stored inside the same file in this folder.
  • Container will mount to those local folders: train_data, test_data and model
  • With this approach, we can conveniently play with every new data coming, by replacing the files inside train_data and/or test_data
  • Container is built both in pure Python and Cython
FROM python:3.8
WORKDIR /app/

RUN mkdir model

ENV TRAIN_FOLDER=./train_data
ENV TEST_FOLDER=./test_data
ENV TRAIN_FILE=train_set.csv
ENV TEST_FILE=test_set.csv
ENV MODEL_DIR=./model
ENV MODEL_NAME=risk_model
ENV TARGET=default

COPY requirements.txt .
COPY default_modeling default_modeling

RUN pip install -r requirements.txt
RUN python3 -m default_modeling.setup build

ENTRYPOINT ["python3"]

Build Image from Dockerfile

!docker build -t default_model -f Dockerfile .

Run unit test in Image

!docker run -t default_model:latest -m unittest discover default_modeling
Found the following test data
default_modeling/tests/data/test_sample_1.csv
..
----------------------------------------------------------------------
Ran 2 tests in 0.772s

OK

Train with the selected file, e.g. train_data/TRAIN_SET_1.csv. If no hyperparameters are declared (like n_estimators, max_depth, ...), the file will train with default hyper parameters. Remember to mount to local train_data, and model

!docker run -v /home/jupyter/train_data:/app/train_data \
            -v /home/jupyter/model:/app/model \
            default_model:latest -m default_modeling.default_modeling.interface.launch_trainer \
            --train-file train_set_1.csv \
            --n-estimators 200 \
            --max-depth 15 \
            --min-samples-leaf 20
extracting arguments
Namespace(max_depth=15, min_samples_leaf=20, model_dir='./model', model_name='risk_model', n_estimators=200, random_state=1234, target='default', train_file='train_set_1.csv', train_folder='./train_data')
Training Data at ./train_data/train_set_1.csv
('Total Input Features', 39)
('class weight', {0: 0.5074062934696794, 1: 34.255076142131976})
Found existing model at: ./model/risk_model.joblib.
Overwriting ...
Congratulation! Saving model at ./model/risk_model.joblib. Finish after 3.684312582015991 s

And predict selected file, e.g: test_data/test_set_1.csv. Now, mount to local test_data, and model

!docker run -v /home/jupyter/test_data:/app/test_data \
            -v /home/jupyter/model:/app/model \
            default_model:latest -m default_modeling.default_modeling.interface.launch_predictor \
            --test-file test_set_1.csv       
extracting arguments
Namespace(model_dir='./model', model_name='risk_model', target='default', test_file='test_set_1.csv', test_folder='./test_data')
Found model at: ./model/risk_model.joblib
Predicting test_set_1.csv ....
Finish after 0.549715518951416 s
...to csv ./test_data/test_set_1.csv

We have prediction in local folder test_data. Evaluate with Metrics

  • Decision threshold on the probability of default would probably depend on credit policy. There could be several cutoff points or a mathematical cost function rather than a fixed decision threshold. Therefore, binary metrics like F1, Recall, or Precision is not meaningful in this situation. And the output should be a prediction in probability.
  • KS-statistic (between P(prediction|truth = 1) and P(prediction|truth = 0) to quantify the distance between 2 classes) are used to evaluate model.
  • Left plot: ROC AUC Curve
  • Right plot: Normalized KS Distribution of 2 types of users:
    • class 0: non-default
    • class 1: default

alt alt

Conclusions & Future Work

  • With KS score = 0.66 and small p-value, this means the predictor can properly distinguish between default and non-default users (test is significant)
  • Visually, we can observe the clear gap in the KS distribution plot between 2 classes
  • In the future, host with AWS Sagemeker endpoint