- predict the probability of default for each user id in risk modeling
- default = 1 means defaulted users, default = 0 means otherwise
- Imbalance binary classification problem
- uuid: text User Id
- default: (or target) boolean (0, or 1)
- Categorical, and numerical features are defined in
default_modeling.utils.preproc.pyx
(functionfeature_definition
)
- If you want to run the experiment with your data for the purpose of binary classification:
- Replace csv in both
train_data
andtest_data
by your csv. (Optional: also change test filetest_sample_1.csv
indefault_modeling/default_modeling/tests/data/
for unit test). Each row of your csv should correspond to unique User ID . - Redefine categorical, numerical features in
default_modeling/default_modeling/utils/preproc.pyx
(functionfeature_definition
) based on your definition - Change
TARGET=default
in Dockerfile toTARGET={your target variable}
- Data example can be seen below
- Replace csv in both
UUID (User id) | Feature 1 | ... | Feature N | Target (binary) |
---|---|---|---|---|
001 | 100 | ... | "AAA" | 0 |
002 | 300 | ... | "BBB" | 1 |
- pandas, numpy, category_encoders, sklearn, scipy, joblib, Cython
.
├── Dockerfile
├── default_modeling
│ ├── __init__.py
│ ├── default_modeling
│ │ ├── __init__.py
│ │ ├── interface
│ │ │ ├── __init__.py
│ │ │ ├── launch_predictor.py
│ │ │ ├── launch_trainer.py
│ │ │ ├── predictor.c
│ │ │ ├── predictor.pyx
│ │ │ ├── trainer.c
│ │ │ └── trainer.pyx
│ │ └── utils
│ │ ├── __init__.py
│ │ ├── load.c
│ │ ├── load.pyx
│ │ ├── preproc.c
│ │ └── preproc.pyx
│ ├── setup.py
│ ├── tests
│ │ ├── __init__.py
│ │ ├── data
│ │ │ └── test_sample_1.csv
│ │ ├── test_case_base.py
│ │ └── test_data_handling.py
├── model
│ └── risk_model.joblib
├── prototype
│ ├── prototype_cython.ipynb
│ └── prototype_python.ipynb
├── requirements.txt
├── test_data
│ ├── test_set_1.csv
│ └── test_set_2.csv
└── train_data
├── train_set_1.csv
└── train_set_2.csv
!python3 -m default_modeling.setup build
!python3 -m unittest discover default_modeling
- model-dir: folder to store trained model (
model
as seen in this repo) - model-name: name of trained .joblib model (
risk_model
saved in foldermodel
in this case) - train-folder: folder contains train csv (
train_data
in this repo) - train-file: selected file in train-folder (
train_set_1.csv
in this case) - target: target columns from data
- test-folder: folder contains test csv (
test_data
in this repo) - test-file: selected file in test-folder (
test_set_1.csv
in this case) - Random forest parameters as sklearn.RandomForestClassifier:
- n-estimators
- max-depth
- min-samples-leaf
- random-state
!python3 -m default_modeling.default_modeling.interface.launch_trainer \
--model-dir ./model \
--model-name risk_model \
--train-folder train_data \
--train-file train_set_1.csv \
--target default
!python3 -m default_modeling.default_modeling.interface.launch_trainer \
--model-dir ./model \
--model-name risk_model \
--train-folder train_data \
--train-file train_set_1.csv \
--target default \
--n-estimators 200 \
--max-depth 15 \
--min-samples-leaf 20
!python3 -m default_modeling.default_modeling.interface.launch_predictor \
--test-file test_set_1.csv \
--model-dir ./model \
--model-name risk_model \
--test-folder test_data \
--test-file test_set_1.csv \
--target default
- My Local Working Directory named
/home/jupyter
. In this local working directory:train_data
folder contains different files for training random forest classifersmodel
folder store the trained.joblib
random forest, and the model will be loaded in this folder for predictiontest_data
folder contains new data coming and waiting for prediction, prediction result will be locally stored inside the same file in this folder.
- Container will mount to those local folders:
train_data
,test_data
andmodel
- With this approach, we can conveniently play with every new data coming, by replacing the files inside
train_data
and/ortest_data
- Container is built both in pure Python and Cython
FROM python:3.8
WORKDIR /app/
RUN mkdir model
ENV TRAIN_FOLDER=./train_data
ENV TEST_FOLDER=./test_data
ENV TRAIN_FILE=train_set.csv
ENV TEST_FILE=test_set.csv
ENV MODEL_DIR=./model
ENV MODEL_NAME=risk_model
ENV TARGET=default
COPY requirements.txt .
COPY default_modeling default_modeling
RUN pip install -r requirements.txt
RUN python3 -m default_modeling.setup build
ENTRYPOINT ["python3"]
!docker build -t default_model -f Dockerfile .
!docker run -t default_model:latest -m unittest discover default_modeling
Found the following test data
default_modeling/tests/data/test_sample_1.csv
..
----------------------------------------------------------------------
Ran 2 tests in 0.772s
OK
Train with the selected file, e.g. train_data/TRAIN_SET_1.csv
. If no hyperparameters are declared (like n_estimators, max_depth, ...), the file will train with default hyper parameters. Remember to mount to local train_data
, and model
!docker run -v /home/jupyter/train_data:/app/train_data \
-v /home/jupyter/model:/app/model \
default_model:latest -m default_modeling.default_modeling.interface.launch_trainer \
--train-file train_set_1.csv \
--n-estimators 200 \
--max-depth 15 \
--min-samples-leaf 20
extracting arguments
Namespace(max_depth=15, min_samples_leaf=20, model_dir='./model', model_name='risk_model', n_estimators=200, random_state=1234, target='default', train_file='train_set_1.csv', train_folder='./train_data')
Training Data at ./train_data/train_set_1.csv
('Total Input Features', 39)
('class weight', {0: 0.5074062934696794, 1: 34.255076142131976})
Found existing model at: ./model/risk_model.joblib.
Overwriting ...
Congratulation! Saving model at ./model/risk_model.joblib. Finish after 3.684312582015991 s
!docker run -v /home/jupyter/test_data:/app/test_data \
-v /home/jupyter/model:/app/model \
default_model:latest -m default_modeling.default_modeling.interface.launch_predictor \
--test-file test_set_1.csv
extracting arguments
Namespace(model_dir='./model', model_name='risk_model', target='default', test_file='test_set_1.csv', test_folder='./test_data')
Found model at: ./model/risk_model.joblib
Predicting test_set_1.csv ....
Finish after 0.549715518951416 s
...to csv ./test_data/test_set_1.csv
- Decision threshold on the probability of default would probably depend on credit policy. There could be several cutoff points or a mathematical cost function rather than a fixed decision threshold. Therefore, binary metrics like F1, Recall, or Precision is not meaningful in this situation. And the output should be a prediction in probability.
- KS-statistic (between P(prediction|truth = 1) and P(prediction|truth = 0) to quantify the distance between 2 classes) are used to evaluate model.
- Left plot: ROC AUC Curve
- Right plot: Normalized KS Distribution of 2 types of users:
- class 0: non-default
- class 1: default
- With KS score = 0.66 and small p-value, this means the predictor can properly distinguish between default and non-default users (test is significant)
- Visually, we can observe the clear gap in the KS distribution plot between 2 classes
- In the future, host with AWS Sagemeker endpoint