MLtoolkit is an R package providing functions to help with machine learning & feature engineering tasks.
You can install MLtoolkit from GitHub using:
# install.packages("devtools")
devtools::install_github("AndrewKostandy/MLtoolkit")
- single_mod_results(): Computes performance metrics of a single caret model object across resamples.
- mult_mod_results(): Computes performance metrics of multiple caret model objects across resamples.
- plot_mod_results(): Plots a box plot with the performance metrics of multiple caret model objects across resamples.
- pred_improve(): Gets the model performance improvement of each predictor relative to the null model.
- plot_pred_improve(): Plots the model performance improvement of each predictor relative to the null model.
- get_perc(): Gets the percentiles & the interquartile range of a dataframe's numeric columns.
- trim_df(): Trims a dataframe's numeric columns using different methods.
- get_prop(): Gets the proportion of each predictor level associated with each outcome level.
- plot_prop(): Plots the proportion of each predictor level having a specific outcome level.
- get_prop2(): Gets the proportion of each combination of two predictors' levels having each outcome level.
- plot_prop2(): Plots the proportion of each combination of two predictors' levels having a specific outcome level.
The following functions can make use of parallelization to increase computation speed if desired:
- mult_mod_results()
- plot_mod_results()
- pred_improve()
- plot_pred_improve()
- get_perc()
- trim_df()
The future package can be used as per the below example:
library(future)
plan("multiprocess") # To enable parallel processing
pred_improve(...) # This function will now make use of parallel processing
plan("sequential") # To return to sequential processing if needed
The two following sections demonstrate how to compute performance metrics for caret model objects across resamples for binary classification and regression and plotting the results in each case.
We'll use the BreastCancer data set from the mlbench library and do some minor modifications on it.
The metrics computed for binary classification are: Area under ROC Curve (AUROC), Sensitivity, Specificity, Area under Precision-Recall Curve (AUPRC), Precision, F1 Score, Accuracy, Cohen's Kappa, Log Loss, Matthews Correlation Coefficient, Concordance, Discordance, Somer's D, KS Statistic, and False Discovery Rate.
library(tidyverse)
library(caret)
#> Warning: replacing previous import 'ggplot2::empty' by 'plyr::empty' when
#> loading 'caret'
library(MLtoolkit)
library(mlbench)
data(BreastCancer)
dat <- BreastCancer %>%
select(-Id) %>%
modify_at(c(1:9), as.numeric)
# Since caret requires the positive class (malignant) to be the first factor level in the outcome:
dat <- mutate(dat, Class = fct_rev(Class))
set.seed(42)
id <- createMultiFolds(dat$Class, k = 10, times = 4)
train_ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 4,
index = id,
classProbs = TRUE,
savePredictions = "final")
glm_fit_1 <- train(Class ~ Cl.thickness + Cell.size + Cell.shape, data = dat,
method = "glm",
trControl = train_ctrl)
glm_fit_2 <- train(Class ~ Cl.thickness + Cell.size + Cell.shape + Marg.adhesion, data = dat,
method = "glm",
trControl = train_ctrl)
glm_fit_3 <- train(Class ~ Bl.cromatin + Normal.nucleoli + Mitoses, data = dat,
method = "glm",
trControl = train_ctrl)
The single_mod_results() function works with a single caret model object and computes its performance metrics:
single_mod_results(glm_fit_1, "GLM 1") %>% head()
#> # A tibble: 6 x 17
#> Model Resample AUROC Sensitivity Specificity AUPRC Precision `F1 Score`
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 GLM 1 Fold01.… 0.996 0.958 0.978 0.949 0.958 0.958
#> 2 GLM 1 Fold01.… 1 1 1 0.96 1 1
#> 3 GLM 1 Fold01.… 0.967 0.875 1 0.931 1 0.933
#> 4 GLM 1 Fold01.… 0.982 0.958 0.978 0.879 0.958 0.958
#> 5 GLM 1 Fold02.… 0.984 0.917 0.978 0.882 0.957 0.936
#> 6 GLM 1 Fold02.… 0.983 0.958 0.956 0.925 0.92 0.939
#> # … with 9 more variables: Accuracy <dbl>, `Cohen's Kappa` <dbl>, `Log
#> # Loss` <dbl>, `Matthews Cor. Coef.` <dbl>, Concordance <dbl>,
#> # Discordance <dbl>, `Somer's D` <dbl>, `KS Statistic` <dbl>, `False
#> # Discovery Rate` <dbl>
The mult_mod_results() function works with multiple caret model objects and computes their model performance metrics:
mod_results <- mult_mod_results(list(glm_fit_1, glm_fit_2, glm_fit_3), c("GLM 1", "GLM 2", "GLM 3"))
mod_results %>% head()
#> # A tibble: 6 x 17
#> Model Resample AUROC Sensitivity Specificity AUPRC Precision `F1 Score`
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 GLM 1 Fold01.… 0.996 0.958 0.978 0.949 0.958 0.958
#> 2 GLM 1 Fold01.… 1 1 1 0.96 1 1
#> 3 GLM 1 Fold01.… 0.967 0.875 1 0.931 1 0.933
#> 4 GLM 1 Fold01.… 0.982 0.958 0.978 0.879 0.958 0.958
#> 5 GLM 1 Fold02.… 0.984 0.917 0.978 0.882 0.957 0.936
#> 6 GLM 1 Fold02.… 0.983 0.958 0.956 0.925 0.92 0.939
#> # … with 9 more variables: Accuracy <dbl>, `Cohen's Kappa` <dbl>, `Log
#> # Loss` <dbl>, `Matthews Cor. Coef.` <dbl>, Concordance <dbl>,
#> # Discordance <dbl>, `Somer's D` <dbl>, `KS Statistic` <dbl>, `False
#> # Discovery Rate` <dbl>
The plot_mod_results() function produces a box plot of the models performance metrics:
plot_mod_results(mod_results, ncol = 3)
This function can alternatively take a list of caret model objects and a list or vector of model names:
# plot_mod_results(list(glm_fit_1, glm_fit_2, glm_fit_3), c("GLM 1", "GLM 2", "GLM 3"))
We'll use the iris data set.
The performance metrics computed for regression are: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Spearman's Rho, Concordance Correlation Coefficient, and RSquared. Note that MAPE will not be provided if any observations equal zero to avoid division by zero.
set.seed(42)
id <- createMultiFolds(iris$Sepal.Length, k = 10, times = 4)
train_ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 4,
index = id,
savePredictions = "final")
lm_fit_1 <- train(Sepal.Length ~ Sepal.Width, data = iris,
method = "lm",
trControl = train_ctrl)
lm_fit_2 <- train(Sepal.Length ~ Sepal.Width + Petal.Length, data = iris,
method = "lm",
trControl = train_ctrl)
lm_fit_3 <- train(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris,
method = "lm",
trControl = train_ctrl)
The single_mod_results() function works with a single caret model object and computes its performance metrics:
single_mod_results(lm_fit_1, "LM 1") %>% head()
#> # A tibble: 6 x 8
#> Model Resample RMSE MAE MAPE `Spearman's Rho` `Concordance Co… R2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 LM 1 Fold01.… 0.940 0.789 14.2 -0.0892 -0.0335 0.0267
#> 2 LM 1 Fold01.… 1.10 0.890 14.9 -0.216 -0.103 0.251
#> 3 LM 1 Fold01.… 0.761 0.616 11.0 0.146 0.0473 0.0787
#> 4 LM 1 Fold01.… 0.771 0.644 10.6 0.389 0.0655 0.181
#> 5 LM 1 Fold02.… 0.726 0.633 11.3 0.216 0.0464 0.0460
#> 6 LM 1 Fold02.… 0.750 0.641 11.5 0.440 0.0676 0.104
The mult_mod_results() function works with multiple caret model objects and computes their model performance metrics:
mod_results <- mult_mod_results(list(lm_fit_1, lm_fit_2, lm_fit_3), c("LM 1", "LM 2", "LM 3"))
mod_results %>% head()
#> # A tibble: 6 x 8
#> Model Resample RMSE MAE MAPE `Spearman's Rho` `Concordance Co… R2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 LM 1 Fold01.… 0.940 0.789 14.2 -0.0892 -0.0335 0.0267
#> 2 LM 1 Fold01.… 1.10 0.890 14.9 -0.216 -0.103 0.251
#> 3 LM 1 Fold01.… 0.761 0.616 11.0 0.146 0.0473 0.0787
#> 4 LM 1 Fold01.… 0.771 0.644 10.6 0.389 0.0655 0.181
#> 5 LM 1 Fold02.… 0.726 0.633 11.3 0.216 0.0464 0.0460
#> 6 LM 1 Fold02.… 0.750 0.641 11.5 0.440 0.0676 0.104
The plot_mod_results() function produces a box plot of the models performance metrics. A 95% confidence interval for the mean can also be added:
plot_mod_results(mod_results, conf_int95 = TRUE)
The "InformationValue", "caret", and "MLmetrics" packages were used to compute many of the performance metrics.
We'll use the BreastCancer data set from the mlbench library that we have modified above and add 4 uninformative predictors to it. 2 of those uninformative predictors will be numeric and 2 will be categorical.
dat2 <- dat %>% mutate(Rand_Num_1 = rnorm(n = nrow(dat)),
Rand_Num_2 = runif(n = nrow(dat)),
Rand_Cat_1 = sample(x = c("Cat1A", "Cat1B", "Cat1C"),
size = nrow(dat), replace = TRUE,
prob = c(0.4, 0.4, 0.2)),
Rand_Cat_2 = sample(x = c("Cat2A", "Cat2B", "Cat2C"),
size = nrow(dat), replace = TRUE,
prob = c(0.1, 0.1, 0.8)))
The pred_improve() function returns the performance improvement of each predictor relative to the null model. If the outcome is categorical, then a logistic regression model is used and the area under the ROC curve is used to assess performance. If the outcome is numeric, then an ordinary least squares model is used and the root mean squared error (RMSE) is used to assess performance.
The results are estimated across resamples and the p-value is determined using a one-sided paired t-test of the predictor results and the null model results in each case. The p-values are adjusted using the Benjamini-Hochberg method to control the false discovery rate.
pred_improve(data = dat2, outcome = Class, seed = 42, folds = 10, repeats = 3)
#> # A tibble: 13 x 3
#> predictor auroc_improvement significance
#> <chr> <dbl> <dbl>
#> 1 Cell.shape 0.473 1.67e-43
#> 2 Cell.size 0.473 1.55e-39
#> 3 Bl.cromatin 0.439 3.39e-34
#> 4 Epith.c.size 0.421 2.70e-31
#> 5 Bare.nuclei 0.420 4.87e-32
#> 6 Cl.thickness 0.409 7.23e-32
#> 7 Marg.adhesion 0.396 4.64e-29
#> 8 Normal.nucleoli 0.388 2.70e-31
#> 9 Mitoses 0.210 4.21e-20
#> 10 Rand_Cat_2 0.0249 8.75e- 3
#> 11 Rand_Cat_1 0.0151 1.04e- 1
#> 12 Rand_Num_1 -0.00991 9.50e- 1
#> 13 Rand_Num_2 -0.0405 9.99e- 1
The plot_pred_improve() function can be used to return a plot directly:
plot_pred_improve(data = dat2, outcome = Class, seed = 42, folds = 10, repeats = 3)
This technique was discussed in the book Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson.
Below is a dataframe with numeric columns including univariate outliers:
training <- data_frame(a = c(10, 11, 12, seq(70, 90, 2), 50, 60),
b = c(3, 11, 12, seq(30, 40, 1), 44, 80))
#> Warning: `data_frame()` is deprecated, use `tibble()`.
#> This warning is displayed once per session.
The trim_df() function will trim univariate outliers as follows:
If type = "iqr", then for each numeric variable:
-
Values below the 25th percentile by more than 1.5 x IQR are trimmed to be exactly 1.5 x IQR below the 25th percentile.
-
Values above the 75th percentile by more than 1.5 x IQR are trimmed to be exactly 1.5 x IQR above the 75th percentile.
If type = "1_99", then for each numeric variable:
-
Values below the 1st percentile are trimmed to be exactly the value of the 1st percentile.
-
Values above the 99th percentile are trimmed to be exactly the value of the 99th percentile.
training_trimmed <- trim_df(training, type = "iqr")
This is our training data before and after trimming:
Note that test data can be trimmed using the training data percentile values.
Let's make some test data:
test <- data_frame(a = c(0, 11, 12, seq(70, 90, 2), 50, 100),
b = c(25, 11, 12, seq(25, 35, 1), 100, 90))
Let's get the percentiles of our original data:
training_percentiles <- get_perc(training)
training_percentiles
#> # A tibble: 6 x 3
#> Key a b
#> <chr> <dbl> <dbl>
#> 1 Perc 1 10.1 4.20
#> 2 Perc 25 57.5 30.8
#> 3 Perc 50 75 34.5
#> 4 Perc 75 82.5 38.2
#> 5 Perc 99 89.7 74.6
#> 6 IQR 25 7.5
Let's use the percentiles of our training data to trim the test data:
test_trimmed <- trim_df(test, type = "iqr", training_percentiles)
Let's plot the test data before and after trimming using the percentiles of the original data:
This is what the diamonds data set looks like:
head(diamonds)
#> # A tibble: 6 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#> 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#> 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Let's take the "cut" column to be our outcome and the "clarity" column to be our predictor. We can get the proportion of each predictor level associated with each outcome level.
get_prop(diamonds, clarity, cut) %>% head()
#> # A tibble: 6 x 7
#> clarity cut n sum_n prop low_95ci high_95ci
#> <ord> <ord> <int> <int> <dbl> <dbl> <dbl>
#> 1 I1 Fair 210 741 0.283 0.251 0.318
#> 2 I1 Good 96 741 0.130 0.107 0.156
#> 3 I1 Very Good 84 741 0.113 0.0919 0.139
#> 4 I1 Premium 205 741 0.277 0.245 0.311
#> 5 I1 Ideal 146 741 0.197 0.169 0.228
#> 6 SI2 Fair 466 9194 0.0507 0.0463 0.0554
To plot the proportions and their confidence intervals of each predictor level having a specific outcome level, we can use the plot_prop() function.
diamonds2 <- diamonds %>%
mutate(cut = fct_collapse(cut,
`>= Premium` = c("Premium", "Ideal"),
`<= Very Good` = c("Fair", "Good", "Very Good")))
plot_prop(diamonds2, clarity, cut, ref_level = ">= Premium", ref_value = 0.5)
This technique was discussed in the book Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson.
Let's view a modified version of the Titanic dataset:
Titanic2 <- as.data.frame(Titanic)
Titanic2 <- Titanic2[rep(seq(1:nrow(Titanic2)), Titanic2$Freq), -5]
head(Titanic2)
#> Class Sex Age Survived
#> 3 3rd Male Child No
#> 3.1 3rd Male Child No
#> 3.2 3rd Male Child No
#> 3.3 3rd Male Child No
#> 3.4 3rd Male Child No
#> 3.5 3rd Male Child No
We can get the proportion of each combination of two predictors' levels having each outcome level.
get_prop2(Titanic2, Class, Sex, Survived) %>% head()
#> # A tibble: 6 x 8
#> Class Sex Survived n sum_n prop low_95ci high_95ci
#> <fct> <fct> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 1st Male No 118 180 0.656 0.581 0.724
#> 2 1st Male Yes 62 180 0.344 0.276 0.419
#> 3 1st Female No 4 145 0.0276 0.00887 0.0735
#> 4 1st Female Yes 141 145 0.972 0.926 0.991
#> 5 2nd Male No 154 179 0.860 0.799 0.906
#> 6 2nd Male Yes 25 179 0.140 0.0941 0.201
To plot the proportions and their confidence intervals of each combination of two predictors' levels having a specific outcome level, we can use the plot_prop2() function.
plot_prop2(Titanic2, Class, Sex, Survived, ref_level = "Yes", ref_value = 0.323)