goodfit -- Takes the predicted results from a binary outcome model and displays goodness of fit measures.
goodfit [true_y] [y_pred] [if] [, cutoff(integer) max_cutoff n_quart(integer) mcc_graph roc_graph pr_graph]
This program is intended to be used with any binary outcome model such as but not limited to probit, logit, logistic, or lasso. It takes the predicted outcome and provides a summary table for the goodness of fit. The program took inspiration from estat classification , but is not limited by model choice and provides an approximate estimate of the optimal positive cutoff threshold using the Matthews Correlation Coefficient (MCC). In the area machine learning with binary classification the Matthews Correlation Coefficient (MCC) is the preferred single metric, especially for imbalanced data (Chicco & Jurman 2020)(Boughorbel et al. 2017). The metric ranges
MCC = (TP×TN-FP×FN) / sqrt((TP+FP)×(TP+FN)×(TN+FP)×(TN+FN))
It another metric is preferred use the cutoff option and the return results to test another measure.
true_y the variable name of the original outcomes variable.
y_pred the variable name of the predicted outcome variable.
cutoff the positive cutoff threshold if max_cutoff is not used. The default number is set to 0.5.
max_cutoff approximates the optimal positive cutoff threshold by a grid search using quartiles of the predicted outcome as estimation points. The default number of quartiles is 50.
n_quart Allow the user to set the number of quartiles overriding the default 50.
mcc_graph Graphs several goodness of fit measures including MCC over range of potential cutoffs points for the predicted outcome measure.
roc_graph Graphs receiver operating characteristic curve (ROC) which places true positive rate on the y-axis and false positive rate on the x-axis. It also calculates the area under the curve to help in model comparison.
pr_graph Graphs the precision-recall (PRC) curve and is considered a better measure than ROC with imbalanced data (Saito & Rehmsmeier 2015). It also calculates the area under the curve to help in model comparison.
Note : If cutoff is not used then max_cutoff is required
goodfit stores the following in r():
r(MCC) estimated max MCC value
r(p_correct) percent correctly classified
r(f_cutoff) final cutoff value
r(p_neg_pred) negative predictive value
r(p_pos_pred) positive predictive value
r(p_t_pos_rate) true positive rate
r(p_t_neg_rate) true negative rate
r(p_f_pos_rate) false positive rate
r(p_f_neg_rate) false negative rate
e(Gph_results) Contains the results each quartile estimation
r(y_pred_str) Contains the name of the predicted outcome variable.
r(y_outcome_str) Contains the name of the true outcome variable.
If there are any issues or suggestions with the program than please report the with the following steps:
If you have a Github account..
- Go to the goodfit Github repository issue page https://github.com/jphenson/goodfit/issues
- Click the green button labeled "New issue"
- Submit your issue or suggestion.
If you do not have a Github account please email me at jphenson1218@gmail.com.
James Patrick Henson
Georgia State University
Federal Reserve Bank of Atlanta
Atlanta, GA USA
jphenson1218@gmail.com
Website
Github
Boughorbel S, Jarray F, El-Anbari M. 2017. Optimal classifier for imbalanced data using matthews correlation coefficient metric. PloS one. 12(6):e0177678
Chicco D, Jurman G. 2020. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics. 21(1):6
Saito T, Rehmsmeier M. 2015. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one. 10(3):e0118432