This Repo is a fork of https://github.com/Lion-Mod/HR-Attrition which contains bug fixes to that repo and reproduces similar results.
https://sdv.dev/SDV/user_guides/evaluation/evaluation_framework.html
- "The output of this function call will be a number between 0 and 1 that will indicate us how similar the two tables are, being 0 the worst and 1 the best possible score."
- This is incorrect even in the given documentation example
- bugs
- sdv parameter names for copulaGAN had to be updated
- the ord_feats had to be fixed
- "\r" in the raw ipynb file causes an editor crash in jupyter notebook, I removed all of them in a python script
- Methodology Issues
- He used AUC to choose his first model which was lr
- Then he used AUC to choose his last model which was catboost, but he chose gbc which had the second highest AUC
- I tried gbc with synthetic + original data and with only original data and found you get higher results with synthetic + original data
- Dataset differences
- the file size is smaller for the dataset given compared to the kaggle ibm one that is linked.
- Both had a dimension of (1470, 35) so I think the difference is the compression algorithm from storing the data on github
data | Classifier | Accuracy | AUC | Recall | Precision | F1 | Kappa | MCC |
Original | lr | 0.8794 | 0.8534 | 0.4463 | 0.7006 | 0.5388 | 0.4746 | 0.4934 |
Original + synth | lr | 0.8971 | 0.9564 | 0.8420 | 0.9512 | 0.8562 | 0.7964 | 0.8200 |
Original | gbc | 0.8686 | 0.8195 | 0.3140 | 0.7010 | 0.4233 | 0.3648 | 0.4056 |
Original + synth | gbc | 0.8971 | 0.9564 | 0.8420 | 0.9512 | 0.8562 | 0.7964 | 0.8200 |
lr = logistic regression
gbc = Gradient boosting classifier