"Not all variables in the recipe are present" even after using `update_role_requirements` #1196

walrossker · 2023-09-06T21:03:03Z

The problem

I'd like a variable included in the recipe step to be ignored when actually fitting a model (whether it is present in the data or not). In my understanding, that's one use of the update_role_requirements function, but it's not working as expected.

Reproducible example

library(tidymodels)
library(forcats)

set.seed(42)

# Create full dataset that does not yet have the stratifying variable
dat <- starwars %>%
  drop_na(gender) %>%
  mutate(human = if_else(species == "Human", "human", "non-human"),
         across(c(where(is.character)), factor),
         across(mass, ~ if_else(.x > 500, NA_real_, .x))) %>%
  select(name, gender, human, height, mass)

# Split into training and testing sets stratified on a new variable
train_test_split <- dat %>%
  mutate(gender_by_human = paste0(gender, "|", human)) %>%
  initial_split(prop = 3/4, strata = gender_by_human)

# Create workflow
rec <- recipe(gender ~ ., data = training(train_test_split)) %>%
  # Change role of stratifying variable (and ID) to "other"
  update_role(c(name, gender_by_human), new_role = "other") %>%
  # Ignore the stratifying variable when baking:
  update_role_requirements(role = "other", bake = FALSE) %>%
  step_impute_knn(all_predictors()) %>%
  step_dummy(all_nominal_predictors())

spec <- logistic_reg() %>%
  set_mode("classification")

wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(spec)

# Assess performance on test set
wf %>% last_fit(train_test_split) %>% collect_metrics()
#> # A tibble: 2 × 4
#>   .metric  .estimator .estimate .config
#>   <chr>    <chr>          <dbl> <chr>
#> 1 accuracy binary         0.818 Preprocessor1_Model1
#> 2 roc_auc  binary         0.865 Preprocessor1_Model1

# Attempt to fit model on the full dataset
wf %>% fit(data = dat)
#> Error in `check_training_set()`:
#> ! Not all variables in the recipe are present in the supplied training set: 'gender_by_human'.
#> Backtrace:
#>      ▆
#>   1. ├─wf %>% fit(data = dat)
#>   2. ├─generics::fit(., data = dat)
#>   3. └─workflows:::fit.workflow(., data = dat)
#>   4.   └─workflows::.fit_pre(workflow, data)
#>   5.     ├─generics::fit(action, workflow = workflow, data = data)
#>   6.     └─workflows:::fit.action_recipe(action, workflow = workflow, data = data)
#>   7.       ├─hardhat::mold(recipe, data, blueprint = blueprint)
#>   8.       └─hardhat:::mold.recipe(recipe, data, blueprint = blueprint)
#>   9.         ├─hardhat::run_mold(blueprint, data = data)
#>  10.         └─hardhat:::run_mold.default_recipe_blueprint(blueprint, data = data)
#>  11.           └─hardhat:::mold_recipe_default_process(...)
#>  12.             ├─recipes::prep(...)
#>  13.             └─recipes:::prep.recipe(...)
#>  14.               └─recipes:::check_training_set(training, x, fresh)
#>  15.                 └─rlang::abort(...)

sessionInfo()
#> R version 4.2.3 (2023-03-15 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.utf8
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> other attached packages:
#>  [1] forcats_0.5.2      yardstick_1.2.0    workflowsets_1.0.1 workflows_1.1.3
#>  [5] tune_1.1.2         tidyr_1.3.0        tibble_3.2.1       rsample_1.2.0
#>  [9] recipes_1.0.8      purrr_1.0.2        parsnip_1.1.1      modeldata_1.2.0
#> [13] infer_1.0.4        ggplot2_3.4.3      dplyr_1.1.2        dials_1.2.0
#> [17] scales_1.2.1       broom_1.0.5        tidymodels_1.1.1
#>
#> loaded via a namespace (and not attached):
#>  [1] foreach_1.5.2       splines_4.2.3       R.utils_2.12.2
#>  [4] prodlim_2019.11.13  GPfit_1.0-8         yaml_2.3.7
#>  [7] globals_0.16.2      ipred_0.9-13        pillar_1.9.0
#> [10] backports_1.4.1     lattice_0.20-45     glue_1.6.2
#> [13] digest_0.6.31       hardhat_1.3.0       colorspace_2.1-0
#> [16] htmltools_0.5.4     Matrix_1.5-3        R.oo_1.25.0
#> [19] timeDate_4022.108   pkgconfig_2.0.3     lhs_1.1.6
#> [22] DiceDesign_1.9      listenv_0.9.0       gower_1.0.1
#> [25] lava_1.7.1          timechange_0.2.0    styler_1.9.1
#> [28] generics_0.1.3      ellipsis_0.3.2      withr_2.5.0
#> [31] furrr_0.3.1         nnet_7.3-18         cli_3.6.1
#> [34] survival_3.5-0      magrittr_2.0.3      evaluate_0.20
#> [37] R.methodsS3_1.8.2   fs_1.6.1            future_1.30.0
#> [40] fansi_1.0.4         parallelly_1.34.0   R.cache_0.16.0
#> [43] MASS_7.3-58.2       class_7.3-21        tools_4.2.3
#> [46] lifecycle_1.0.3     munsell_0.5.0       reprex_2.0.2
#> [49] compiler_4.2.3      rlang_1.1.1         grid_4.2.3
#> [52] iterators_1.0.14    rstudioapi_0.15.0   rmarkdown_2.20
#> [55] gtable_0.3.3        codetools_0.2-19    R6_2.5.1
#> [58] lubridate_1.9.1     knitr_1.42          fastmap_1.1.0
#> [61] future.apply_1.10.0 utf8_1.2.3          parallel_4.2.3
#> [64] Rcpp_1.0.10         vctrs_0.6.3         rpart_4.1.19
#> [67] tidyselect_1.2.0    xfun_0.36

The text was updated successfully, but these errors were encountered:

EmilHvitfeldt · 2024-06-08T00:21:48Z

This is happening because update_role_requirements() only changes the role requirements for bake() time. When you call wf %>% fit(data = dat) you need to prep() the recipe, where gender_by_human is needed, but isn't part of dat.

github-actions · 2024-06-22T00:28:10Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

EmilHvitfeldt added the question label Sep 7, 2023

EmilHvitfeldt closed this as completed Jun 8, 2024

github-actions bot locked and limited conversation to collaborators Jun 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Not all variables in the recipe are present" even after using `update_role_requirements` #1196

"Not all variables in the recipe are present" even after using `update_role_requirements` #1196

walrossker commented Sep 6, 2023

EmilHvitfeldt commented Jun 8, 2024

github-actions bot commented Jun 22, 2024

"Not all variables in the recipe are present" even after using update_role_requirements #1196

"Not all variables in the recipe are present" even after using update_role_requirements #1196

Comments

walrossker commented Sep 6, 2023

The problem

Reproducible example

EmilHvitfeldt commented Jun 8, 2024

github-actions bot commented Jun 22, 2024

"Not all variables in the recipe are present" even after using `update_role_requirements` #1196

"Not all variables in the recipe are present" even after using `update_role_requirements` #1196