BREAKING CHANGE: loglik, AIC, BIC, feature-creep #1119

berndbischl · 2024-08-24T10:07:33Z

WARNING: BEFORE WE MERGE:

aic and bic should be removed from book, loglik too
check mlr3learners
mlr3tuning autotuner
add to NEWS

* update docs for a fine granularity of weight columns * add deprecation warning and rename old weight role * implemented new weights in task * added weights everywhere, docs, tests * changed defaults * improve docs * rename role weights_train to weights_learner for consistency * fix printer test * improve docs * cleanup * better soft deprecate * fix typos * next try * hard deprecate weights of learner, all workarounds failed * adapt autotest * update news * cleanup * fix autotest * fix error message * change defaults * fix test * fix: properties * refactor: add use weight parameter in base learner * chore: deprecated * tests: col roles * fix: use weights * test: use weights * fix --------- Co-authored-by: be-marc <marcbecker@posteo.de>

mllg · 2024-08-27T08:58:25Z

Also needs to be removed from mlr3learners for log reg, multinom and linear regression. Or could just stay in the package with improved documentation.

mllg · 2024-08-29T16:31:32Z

TBH, I'd really like to keep AIC/BIC in the packages. I show this during teaching together with a SFS and compare it to step().

mllg · 2024-08-29T16:34:47Z

After a closer look, we could remove the properties and methods altogether but keep the measures for AIC and BIC. The user then has to ensure that stats::logLik() is properly working on the fitted model. This is a S3 generic and can be extended. We would only lose some pre-checks and improved error messages.

sebffischer · 2024-08-30T07:25:50Z

NEWS.md

+  The weights used during training by the Learner are renamed to `weights_learner`, the previous column role `weight` is dysfunctional.
+  Additionally, it is now possible to disable the use of weights via the new hyperparameter `use_weights`.
+  Note that this is a breaking change, but appears to be the less error-prone solution in the long run.
+* refactor: Deprecated `data_format` and `data_formats` for Learners, Tasks, and DataBackends.


Just because I see it here, I don't think "Deprecation" is a refactor?

sebffischer · 2024-08-30T07:26:55Z

NEWS.md

@@ -1,24 +1,29 @@
 # mlr3 (development version)

-* Deprecated `data_format` and `data_formats` for Learners, Tasks, and DataBackends.
+* refactor: It is now possible to use weights also during scoring predictions via measures and during resampling to sample observations with unequal probability.


This is a BREAKING_CHANGE and not a refactor I think

sebffischer · 2024-08-30T07:29:28Z

NEWS.md

+  The weights used during training by the Learner are renamed to `weights_learner`, the previous column role `weight` is dysfunctional.
+  Additionally, it is now possible to disable the use of weights via the new hyperparameter `use_weights`.
+  Note that this is a breaking change, but appears to be the less error-prone solution in the long run.
+* refactor: Deprecated `data_format` and `data_formats` for Learners, Tasks, and DataBackends.
 * feat: The `partition()` function creates training, test and validation sets.
 * refactor: Optimize runtime of fixing factor levels.


I think this should be "perf" and not "refactor".

This is my template:

# * feat A new feature - SemVar PATCH # * fix A bug fix - SemVar MINOR # * BREAKING CHANGE Breaking API change - SemVar MAJOR # * docs Change to documentation only # * style Change to style (whitespace, etc.) # * refactor Change not related to a bug or feat # * perf Change that affects performance # * test Change that adds/modifies tests # * build Change to build system # * ci Change to CI pipeline/workflow # * chore General tooling/config/min refactor

R/Learner.R

sebffischer · 2024-08-30T07:44:06Z

R/Measure.R

+#' @section Weights:
+#'
+#' Many measures support observation weights, indicated by their property `"weights"`.
+#' The weights are stored in the [Task] where the column role `weights_measure` needs to be assigned to a single numeric column.


Suggested change

#' The weights are stored in the [Task] where the column role `weights_measure` needs to be assigned to a single numeric column.

#' The weights are stored in the [Task] where the column role `weights_measure` needs to be assigned to a single positive numeric column.

sebffischer · 2024-08-30T08:14:59Z

R/Task.R

    #'
    #' `col_roles` is a named list whose elements are named by column role and each element is a `character()` vector of column names.
    #' To alter the roles, just modify the list, e.g. with \R's set functions ([intersect()], [setdiff()], [union()], \ldots).
    #' The method `$set_col_roles` provides a convenient alternative to assign columns to roles.
+    #'
+    #' The roles `weights_learner`, `weights_measure` and `weights_resampling` may only point to a single numeric column, but they can
+    #' all point to the same column or different columns. Weights must be non-negative numerics with at least one weight being > 0.


weights_resampling usually require more than one weight being > 0

sebffischer · 2024-08-30T08:18:40Z

R/Task.R

+      if (length(weight_cols) == 0L) {
+        return(NULL)
+      }
+      data = self$backend$data(private$.row_roles$use, c(self$backend$primary_key, weight_cols))


This might fail because columns can be returned in arbitrary order:

mlr3/R/DataBackendDataTable.R

Line 61 in 57b6109

#' Rows are guaranteed to be returned in the same order as `rows`, columns may be returned in an arbitrary order.

Instead you should use the $data() method of the task:

mlr3/R/Task.R

Line 265 in 57b6109

#' Rows and columns are returned in the order specified via the arguments `rows` and `cols`.

sebffischer · 2024-08-30T08:18:55Z

R/Task.R

      assert_has_backend(self)
      assert_ro_binding(rhs)
-      weight_cols = private$.col_roles$weight
+      weight_cols = private$.col_roles[["weights_measure"]]
      if (length(weight_cols) == 0L) {
        return(NULL)
      }
      data = self$backend$data(private$.row_roles$use, c(self$backend$primary_key, weight_cols))


same as above

sebffischer · 2024-08-30T08:20:45Z

R/Task.R

@@ -1189,12 +1250,23 @@ task_set_roles = function(li, elements, roles = NULL, add_to = NULL, remove_from
 }

 task_check_col_roles = function(self, new_roles) {
-  for (role in c("group", "weight", "name")) {
+  if ("weight" %in% names(new_roles)) {
+    stopf("Task role 'weight' is deprecated, use 'weights_learner' instead")


If we throw an error here I don't think "deprecated" is the right word? It is already not supported anymore?

sebffischer · 2024-08-30T08:22:55Z

inst/testthat/helper_autotest.R

+  # enable weights
+  if ("weights" %in% learner$properties) {
+    # learner must have flag "use_weights"
+    msg = checkmate::check_subset("use_weights", learner$param_set$ids(), empty.ok = FALSE)


Suggested change

msg = checkmate::check_subset("use_weights", learner$param_set$ids(), empty.ok = FALSE)

msg = checkmate::check_subset("use_weights", learner$param_set$ids())

sebffischer · 2024-08-30T09:25:43Z

One more comment about the weights_resampling. What do you think about not making this weights, but counts instead?
For the holdout resampling, e.g., for a Task with row_ids 1:10 we could set the resampling_weights to c(rep(2, 5), rep(1, 5)).
Then, when sampling the training data we would first modify the row_ids to be c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 8, 9, 10), then sample the training data and use those IDs that were not sampled for the train set as the test set.

sebffischer · 2024-08-30T10:23:59Z

One more comment about the weights_resampling. What do you think about not making this weights, but counts instead? For the holdout resampling, e.g., for a Task with row_ids 1:10 we could set the resampling_weights to c(rep(2, 5), rep(1, 5)). Then, when sampling the training data we would first modify the row_ids to be c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 8, 9, 10), then sample the training data and use those IDs that were not sampled for the train set as the test set.

Otoh the question is then whether we really need this because we can already set the $row_ids of the task ...

berndbischl · 2024-09-05T12:24:07Z

After a closer look, we could remove the properties and methods altogether but keep the measures for AIC and BIC. The user then has to ensure that stats::logLik() is properly working on the fitted model. This is a S3 generic and can be extended. We would only lose some pre-checks and improved error messages.

@mllg. this is fine I HOPE, IF IF IF we not implement any further stuff for this.
this would mean, AIC / BIC have the "req_model" property and they simply internally call stats:logLik.
if that does not work, the measure fails. and we doc that this happens.

be-marc · 2024-11-13T15:51:40Z

Moved to #1207

mllg and others added 3 commits August 23, 2024 20:24

fix: default fallback learner

eef7717

...

d9e9f43

berndbischl requested review from sebffischer and be-marc and removed request for sebffischer August 24, 2024 10:07

man + NEWS

6eaedce

berndbischl force-pushed the main branch from da4d632 to 27531f6 Compare August 28, 2024 19:38

be-marc changed the title ~~Removed loglik, AIC, BIC, feature-creep~~ BREAKING CHANGE: loglik, AIC, BIC, feature-creep Aug 29, 2024

sebffischer reviewed Aug 30, 2024

View reviewed changes

be-marc closed this Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BREAKING CHANGE: loglik, AIC, BIC, feature-creep #1119

BREAKING CHANGE: loglik, AIC, BIC, feature-creep #1119

berndbischl commented Aug 24, 2024 •

edited

Loading

mllg commented Aug 27, 2024

mllg commented Aug 29, 2024 •

edited

Loading

mllg commented Aug 29, 2024 •

edited

Loading

sebffischer Aug 30, 2024

sebffischer Aug 30, 2024

sebffischer Aug 30, 2024

sebffischer Aug 30, 2024

sebffischer Aug 30, 2024

sebffischer Aug 30, 2024

sebffischer Aug 30, 2024

sebffischer Aug 30, 2024

sebffischer Aug 30, 2024

sebffischer commented Aug 30, 2024

sebffischer commented Aug 30, 2024

berndbischl commented Sep 5, 2024

be-marc commented Nov 13, 2024

	#' The weights are stored in the [Task] where the column role `weights_measure` needs to be assigned to a single numeric column.
	#' The weights are stored in the [Task] where the column role `weights_measure` needs to be assigned to a single positive numeric column.

	msg = checkmate::check_subset("use_weights", learner$param_set$ids(), empty.ok = FALSE)
	msg = checkmate::check_subset("use_weights", learner$param_set$ids())

BREAKING CHANGE: loglik, AIC, BIC, feature-creep #1119

BREAKING CHANGE: loglik, AIC, BIC, feature-creep #1119

Conversation

berndbischl commented Aug 24, 2024 • edited Loading

mllg commented Aug 27, 2024

mllg commented Aug 29, 2024 • edited Loading

mllg commented Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebffischer commented Aug 30, 2024

sebffischer commented Aug 30, 2024

berndbischl commented Sep 5, 2024

be-marc commented Nov 13, 2024

berndbischl commented Aug 24, 2024 •

edited

Loading

mllg commented Aug 29, 2024 •

edited

Loading

mllg commented Aug 29, 2024 •

edited

Loading