diff --git a/NEWS.md b/NEWS.md index 16ab37e..b8d0645 100644 --- a/NEWS.md +++ b/NEWS.md @@ -11,7 +11,17 @@ with the topline `emfx(..., type = )` argument. (#49) stripping away data-heavy attributes that are unlikely to be needed afterwards. (#51) - Native `plot.emfx()` method (via a **tinyplot** backend) for visualizing -`emfx` objects. (#54) +`emfx` objects. (#54) + +## Superseded arguments + +- The `collapse` argument in `emfx()` is superseded by `compress`. The older +argument is retained as an alias for backwards compatibility, but will now +trigger a message, nudging users to switch to `compress` instead. The end result +will be identical, though. This cosmetic change was motivated by a desire to be +more consistent with the phrasing used in the literature (i.e., on +performance-boosting within group compression and weighting). See Wong _et al._ +([2021](https://doi.org/10.48550/arXiv.2102.11297)), for example. (#57) ## Documentation diff --git a/R/emfx.R b/R/emfx.R index e2adc23..b54d0bf 100644 --- a/R/emfx.R +++ b/R/emfx.R @@ -18,15 +18,19 @@ #' of the underlying `etwfe` model object. Users can override by setting to #' either `FALSE` or `TRUE.` See the "Heterogeneous treatment effects" #' section below. -#' @param collapse Logical. Collapse the data by (period by cohort) groups +#' @param compress Logical. Compress the data by (period by cohort) groups #' before calculating marginal effects? This trades off a slight loss in #' precision (typically around the 1st or 2nd significant decimal point) for a #' substantial improvement in estimation time for large datasets. The default -#' behaviour (`"auto"`) is to automatically collapse if the original dataset +#' behaviour (`"auto"`) is to automatically compress if the original dataset #' has more than 500,000 rows. Users can override by setting either `FALSE` or #' `TRUE`. Note that collapsing by group is only valid if the preceding `etwfe` #' call was run with `"ivar = NULL"` (the default). See the "Performance #' tips" section below. +#' @param collapse Logical. An alias for `compress` (only used for backwards +#' compatability and ignored if both arguments are provided). The behaviour is +#' identical, but it will trigger a message nudging users to rather use the +#' `compress` argument. #' @param predict Character. The type (scale) of prediction used to compute the #' marginal effects. If `"response"` (the default), then the output is at the #' level of the response variable, i.e. it is the expected predictor @@ -79,6 +83,11 @@ #' - `s.value` #' - `conf.low` #' - `conf.high` +#' @references +#' Wong, Jeffrey _et al._ (2021). \cite{You Only Compress Once: Optimal Data +#' Compression for Estimating Linear Models}. Working paper (version: March 16, +#' 2021). Available: +#' https://doi.org/10.48550/arXiv.2102.11297 #' @seealso [marginaleffects::slopes] which does the heavily lifting behind the #' scenes. [`etwfe`] is the companion estimating function that should be run #' before `emfx`. @@ -92,7 +101,8 @@ emfx = function( object, type = c("simple", "group", "calendar", "event"), by_xvar = "auto", - collapse = "auto", + compress = "auto", + collapse = compress, predict = c("response", "link"), post_only = TRUE, lean = TRUE, @@ -142,15 +152,22 @@ emfx = function( dat = as.data.table(eval(object$call$data, object$call_env)) if ("group" %in% names(dat)) dat[["group"]] = NULL - # check collapse argument + # check compress argument nrows = NULL - if (collapse == "auto") { + if (compress == "auto" && collapse != compress) { + compress = collapse + message( + "\nPlease note that the `collapse` argument has been superseded by `compress`. ", + "Both arguments have the identical effect, but we encourage users to use `compress` going forward.\n" + ) + } + if (compress == "auto") { nrows = nrow(dat) if (nrows >= 5e5) { - collapse = TRUE + compress = TRUE } else { - collapse = FALSE + compress = FALSE } } @@ -168,7 +185,7 @@ emfx = function( } # define formulas and calculate weights - if(collapse & is.null(ivar)){ + if(compress & is.null(ivar)){ if(by_xvar){ dat_weights = dat[(.Dtreat)][, .N, by = c(gvar, tvar, xvar)] } else { @@ -177,19 +194,19 @@ emfx = function( if (!is.null(nrows) && nrows > 5e5) warning( "\nNote: Dataset larger than 500k rows detected. The data will be ", - "collapsed by period-cohort groups to reduce estimation times. ", + "compressed by period-cohort groups to reduce estimation times. ", "However, this shortcut can reduce the accuracy of the reported ", "marginal effects. ", "To override this default behaviour, specify: ", - "`emfx(..., collapse = FALSE)`\n" + "`emfx(..., compress = FALSE)`\n" ) - # collapse the data - dat = dat[(.Dtreat)][, lapply(.SD, mean), by = c(gvar, tvar, xvar, ".Dtreat")] # collapse data + # compress the data + dat = dat[(.Dtreat)][, lapply(.SD, mean), by = c(gvar, tvar, xvar, ".Dtreat")] # compress data dat = setDT(dat)[, merge(.SD, dat_weights, all.x = TRUE)] # add weights - } else if (collapse & !is.null(ivar)) { + } else if (compress & !is.null(ivar)) { warning("\"ivar\" is not NULL. Marginal effects are calculated without collapsing.") dat$N = 1L @@ -197,7 +214,7 @@ emfx = function( dat$N = 1L } - # collapse the data + # compress the data if (type=="simple") { by_var = ".Dtreat" } else if (type=="group") { diff --git a/man/emfx.Rd b/man/emfx.Rd index acd39e2..54620ee 100644 --- a/man/emfx.Rd +++ b/man/emfx.Rd @@ -8,7 +8,8 @@ emfx( object, type = c("simple", "group", "calendar", "event"), by_xvar = "auto", - collapse = "auto", + compress = "auto", + collapse = compress, predict = c("response", "link"), post_only = TRUE, lean = TRUE, @@ -29,16 +30,21 @@ of the underlying \code{etwfe} model object. Users can override by setting to either \code{FALSE} or \code{TRUE.} See the "Heterogeneous treatment effects" section below.} -\item{collapse}{Logical. Collapse the data by (period by cohort) groups +\item{compress}{Logical. Compress the data by (period by cohort) groups before calculating marginal effects? This trades off a slight loss in precision (typically around the 1st or 2nd significant decimal point) for a substantial improvement in estimation time for large datasets. The default -behaviour (\code{"auto"}) is to automatically collapse if the original dataset +behaviour (\code{"auto"}) is to automatically compress if the original dataset has more than 500,000 rows. Users can override by setting either \code{FALSE} or \code{TRUE}. Note that collapsing by group is only valid if the preceding \code{etwfe} call was run with \code{"ivar = NULL"} (the default). See the "Performance tips" section below.} +\item{collapse}{Logical. An alias for \code{compress} (only used for backwards +compatability and ignored if both arguments are provided). The behaviour is +identical, but it will trigger a message nudging users to rather use the +\code{compress} argument.} + \item{predict}{Character. The type (scale) of prediction used to compute the marginal effects. If \code{"response"} (the default), then the output is at the level of the response variable, i.e. it is the expected predictor @@ -263,6 +269,12 @@ etwfe( emfx("event") } +} +\references{ +Wong, Jeffrey \emph{et al.} (2021). \cite{You Only Compress Once: Optimal Data +Compression for Estimating Linear Models}. Working paper (version: March 16, +2021). Available: +https://doi.org/10.48550/arXiv.2102.11297 } \seealso{ \link[marginaleffects:slopes]{marginaleffects::slopes} which does the heavily lifting behind the diff --git a/vignettes/etwfe.Rmd b/vignettes/etwfe.Rmd index afc0d44..913cf0a 100644 --- a/vignettes/etwfe.Rmd +++ b/vignettes/etwfe.Rmd @@ -22,6 +22,7 @@ knitr::opts_chunk$set( options(width = 100) options(rmarkdown.html_vignette.check_title = FALSE) options(marginaleffects_safe = FALSE) +modelsummary::config_modelsummary(startup_message = FALSE) fixest::setFixest_notes(FALSE) ``` @@ -449,27 +450,28 @@ For its part, the second `emfx()` stage also tends to be pretty performant. If your data has less than 100k rows, it's unlikely that you'll have to wait more than a few seconds to obtain results. However, `emfx`'s computation time does tend to scale non-linearly with the size of the original data, as well as the -number of interactions from the underlying `etwfe` model object. Without getting +number of interactions from the underlying `etwfe` model object.^[Without getting too deep into the weeds, we are relying on a numerical delta method of the (excellent) **marginaleffects** package underneath the hood to recover the ATTs of interest. This method requires estimating two prediction models for *each* coefficient in the model and then computing their standard errors. So it's a potentially expensive operation that can push the computation time for large -datasets (> 1m rows) up to several minutes or longer. +datasets (> 1m rows) up to several minutes or longer.] Fortunately, there are two complementary strategies that you can use to speed things up. The first is to turn off the most expensive part of the whole procedure---standard error calculation---by calling `emfx(..., vcov = FALSE)`. -Doing so should bring the estimation time back down to a few seconds or less, +This should bring the estimation time back down to a few seconds or less, even for datasets in excess of a million rows. Of course, the loss of standard errors might not be an acceptable trade-off for projects where statistical -inference is critical. But the good news is this first strategy can still be -combined our second strategy: it turns out that collapsing the data by groups -prior to estimating the marginal effects can yield substantial speed gains on -its own. Users can do this by invoking the `emfx(..., collapse = TRUE)` +inference is critical. But the good news is that we "combine" turning off +standard errors with a second strategy. Specially, it turns out that compressing +the data by groups prior to estimation can yield substantial speed gains on +its own; see Wong _et al._ ([2021](https://doi.org/10.48550/arXiv.2102.11297)) +on this. Users can do this by invoking the `emfx(..., compress = TRUE)` argument. While the effect here is not as dramatic as the first strategy, collapsing the data does have the virtue of retaining information about the -standard errors. The trade-off this time, however, is that collapsing our data +standard errors. The trade-off this time, however, is that compressing our data does lead to a loss in accuracy for our estimated parameters. On the other hand, testing suggests that this loss in accuracy tends to be relatively minor, with results equivalent up to the 1st or 2nd significant decimal place (or even @@ -480,10 +482,10 @@ about the estimation time for large datasets and models: 0. Estimate `mod = etwfe(...)` as per usual. 1. Run `emfx(mod, vcov = FALSE, ...)`. -2. Run `emfx(mod, vcov = FALSE, collapse = TRUE, ...)`. +2. Run `emfx(mod, vcov = FALSE, compress = TRUE, ...)`. 3. Compare the point estimates from steps 1 and 2. If they are are similar enough to your satisfaction, get the approximate standard errors by running -`emfx(mod, collapse = TRUE, ...)`. +`emfx(mod, compress = TRUE, ...)`. It's a bit of performance art, since all of the examples in this vignette complete very quickly anyway. But here is a reworking of our earlier event study @@ -496,18 +498,18 @@ example to demonstrate this performance-conscious workflow. emfx(mod, type = "event", vcov = FALSE) # Step 2 -emfx(mod, type = "event", vcov = FALSE, collapse = TRUE) +emfx(mod, type = "event", vcov = FALSE, compress = TRUE) # Step 3: Results from 1 and 2 are similar enough, so get approx. SEs -mod_es2 = emfx(mod, type = "event", collapse = TRUE) +mod_es_compressed = emfx(mod, type = "event", compress = TRUE) ``` To put a fine point on it, we can can compare our original event study with the -collapsed estimates and see that the results are indeed very similar. +compressed estimates and see that the results are indeed very similar. ```{r} modelsummary( - list("Original" = mod_es, "Collapsed" = mod_es2), + list("Original" = mod_es, "Compressed" = mod_es_compressed), shape = term:event:statistic ~ model, coef_rename = rename_fn, gof_omit = "Adj|Within|IC|RMSE",