Skip to content

Commit

Permalink
collapse -> compress (#57)
Browse files Browse the repository at this point in the history
* collapse -> compress

* vignette

* news

* need explicit namespace
  • Loading branch information
grantmcdermott authored Dec 16, 2024
1 parent 081d5d8 commit ef3a6e5
Show file tree
Hide file tree
Showing 4 changed files with 73 additions and 32 deletions.
12 changes: 11 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,17 @@ with the topline `emfx(..., type = <aggregration_type>)` argument. (#49)
stripping away data-heavy attributes that are unlikely to be needed afterwards.
(#51)
- Native `plot.emfx()` method (via a **tinyplot** backend) for visualizing
`emfx` objects. (#54)
`emfx` objects. (#54)

## Superseded arguments

- The `collapse` argument in `emfx()` is superseded by `compress`. The older
argument is retained as an alias for backwards compatibility, but will now
trigger a message, nudging users to switch to `compress` instead. The end result
will be identical, though. This cosmetic change was motivated by a desire to be
more consistent with the phrasing used in the literature (i.e., on
performance-boosting within group compression and weighting). See Wong _et al._
([2021](https://doi.org/10.48550/arXiv.2102.11297)), for example. (#57)

## Documentation

Expand Down
45 changes: 31 additions & 14 deletions R/emfx.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,19 @@
#' of the underlying `etwfe` model object. Users can override by setting to
#' either `FALSE` or `TRUE.` See the "Heterogeneous treatment effects"
#' section below.
#' @param collapse Logical. Collapse the data by (period by cohort) groups
#' @param compress Logical. Compress the data by (period by cohort) groups
#' before calculating marginal effects? This trades off a slight loss in
#' precision (typically around the 1st or 2nd significant decimal point) for a
#' substantial improvement in estimation time for large datasets. The default
#' behaviour (`"auto"`) is to automatically collapse if the original dataset
#' behaviour (`"auto"`) is to automatically compress if the original dataset
#' has more than 500,000 rows. Users can override by setting either `FALSE` or
#' `TRUE`. Note that collapsing by group is only valid if the preceding `etwfe`
#' call was run with `"ivar = NULL"` (the default). See the "Performance
#' tips" section below.
#' @param collapse Logical. An alias for `compress` (only used for backwards
#' compatability and ignored if both arguments are provided). The behaviour is
#' identical, but it will trigger a message nudging users to rather use the
#' `compress` argument.
#' @param predict Character. The type (scale) of prediction used to compute the
#' marginal effects. If `"response"` (the default), then the output is at the
#' level of the response variable, i.e. it is the expected predictor
Expand Down Expand Up @@ -79,6 +83,11 @@
#' - `s.value`
#' - `conf.low`
#' - `conf.high`
#' @references
#' Wong, Jeffrey _et al._ (2021). \cite{You Only Compress Once: Optimal Data
#' Compression for Estimating Linear Models}. Working paper (version: March 16,
#' 2021). Available:
#' https://doi.org/10.48550/arXiv.2102.11297
#' @seealso [marginaleffects::slopes] which does the heavily lifting behind the
#' scenes. [`etwfe`] is the companion estimating function that should be run
#' before `emfx`.
Expand All @@ -92,7 +101,8 @@ emfx = function(
object,
type = c("simple", "group", "calendar", "event"),
by_xvar = "auto",
collapse = "auto",
compress = "auto",
collapse = compress,
predict = c("response", "link"),
post_only = TRUE,
lean = TRUE,
Expand Down Expand Up @@ -142,15 +152,22 @@ emfx = function(
dat = as.data.table(eval(object$call$data, object$call_env))
if ("group" %in% names(dat)) dat[["group"]] = NULL

# check collapse argument
# check compress argument
nrows = NULL
if (collapse == "auto") {
if (compress == "auto" && collapse != compress) {
compress = collapse
message(
"\nPlease note that the `collapse` argument has been superseded by `compress`. ",
"Both arguments have the identical effect, but we encourage users to use `compress` going forward.\n"
)
}
if (compress == "auto") {
nrows = nrow(dat)

if (nrows >= 5e5) {
collapse = TRUE
compress = TRUE
} else {
collapse = FALSE
compress = FALSE
}
}

Expand All @@ -168,7 +185,7 @@ emfx = function(
}

# define formulas and calculate weights
if(collapse & is.null(ivar)){
if(compress & is.null(ivar)){
if(by_xvar){
dat_weights = dat[(.Dtreat)][, .N, by = c(gvar, tvar, xvar)]
} else {
Expand All @@ -177,27 +194,27 @@ emfx = function(

if (!is.null(nrows) && nrows > 5e5) warning(
"\nNote: Dataset larger than 500k rows detected. The data will be ",
"collapsed by period-cohort groups to reduce estimation times. ",
"compressed by period-cohort groups to reduce estimation times. ",
"However, this shortcut can reduce the accuracy of the reported ",
"marginal effects. ",
"To override this default behaviour, specify: ",
"`emfx(..., collapse = FALSE)`\n"
"`emfx(..., compress = FALSE)`\n"
)

# collapse the data
dat = dat[(.Dtreat)][, lapply(.SD, mean), by = c(gvar, tvar, xvar, ".Dtreat")] # collapse data
# compress the data
dat = dat[(.Dtreat)][, lapply(.SD, mean), by = c(gvar, tvar, xvar, ".Dtreat")] # compress data
dat = setDT(dat)[, merge(.SD, dat_weights, all.x = TRUE)] # add weights


} else if (collapse & !is.null(ivar)) {
} else if (compress & !is.null(ivar)) {
warning("\"ivar\" is not NULL. Marginal effects are calculated without collapsing.")
dat$N = 1L

} else {
dat$N = 1L
}

# collapse the data
# compress the data
if (type=="simple") {
by_var = ".Dtreat"
} else if (type=="group") {
Expand Down
18 changes: 15 additions & 3 deletions man/emfx.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

30 changes: 16 additions & 14 deletions vignettes/etwfe.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ knitr::opts_chunk$set(
options(width = 100)
options(rmarkdown.html_vignette.check_title = FALSE)
options(marginaleffects_safe = FALSE)
modelsummary::config_modelsummary(startup_message = FALSE)
fixest::setFixest_notes(FALSE)
```
Expand Down Expand Up @@ -449,27 +450,28 @@ For its part, the second `emfx()` stage also tends to be pretty performant. If
your data has less than 100k rows, it's unlikely that you'll have to wait more
than a few seconds to obtain results. However, `emfx`'s computation time does
tend to scale non-linearly with the size of the original data, as well as the
number of interactions from the underlying `etwfe` model object. Without getting
number of interactions from the underlying `etwfe` model object.^[Without getting
too deep into the weeds, we are relying on a numerical delta method of the
(excellent) **marginaleffects** package underneath the hood to recover the ATTs
of interest. This method requires estimating two prediction models for *each*
coefficient in the model and then computing their standard errors. So it's a
potentially expensive operation that can push the computation time for large
datasets (> 1m rows) up to several minutes or longer.
datasets (> 1m rows) up to several minutes or longer.]

Fortunately, there are two complementary strategies that you can use to speed
things up. The first is to turn off the most expensive part of the whole
procedure---standard error calculation---by calling `emfx(..., vcov = FALSE)`.
Doing so should bring the estimation time back down to a few seconds or less,
This should bring the estimation time back down to a few seconds or less,
even for datasets in excess of a million rows. Of course, the loss of standard
errors might not be an acceptable trade-off for projects where statistical
inference is critical. But the good news is this first strategy can still be
combined our second strategy: it turns out that collapsing the data by groups
prior to estimating the marginal effects can yield substantial speed gains on
its own. Users can do this by invoking the `emfx(..., collapse = TRUE)`
inference is critical. But the good news is that we "combine" turning off
standard errors with a second strategy. Specially, it turns out that compressing
the data by groups prior to estimation can yield substantial speed gains on
its own; see Wong _et al._ ([2021](https://doi.org/10.48550/arXiv.2102.11297))
on this. Users can do this by invoking the `emfx(..., compress = TRUE)`
argument. While the effect here is not as dramatic as the first strategy,
collapsing the data does have the virtue of retaining information about the
standard errors. The trade-off this time, however, is that collapsing our data
standard errors. The trade-off this time, however, is that compressing our data
does lead to a loss in accuracy for our estimated parameters. On the other hand,
testing suggests that this loss in accuracy tends to be relatively minor, with
results equivalent up to the 1st or 2nd significant decimal place (or even
Expand All @@ -480,10 +482,10 @@ about the estimation time for large datasets and models:

0. Estimate `mod = etwfe(...)` as per usual.
1. Run `emfx(mod, vcov = FALSE, ...)`.
2. Run `emfx(mod, vcov = FALSE, collapse = TRUE, ...)`.
2. Run `emfx(mod, vcov = FALSE, compress = TRUE, ...)`.
3. Compare the point estimates from steps 1 and 2. If they are are similar
enough to your satisfaction, get the approximate standard errors by running
`emfx(mod, collapse = TRUE, ...)`.
`emfx(mod, compress = TRUE, ...)`.

It's a bit of performance art, since all of the examples in this vignette
complete very quickly anyway. But here is a reworking of our earlier event study
Expand All @@ -496,18 +498,18 @@ example to demonstrate this performance-conscious workflow.
emfx(mod, type = "event", vcov = FALSE)
# Step 2
emfx(mod, type = "event", vcov = FALSE, collapse = TRUE)
emfx(mod, type = "event", vcov = FALSE, compress = TRUE)
# Step 3: Results from 1 and 2 are similar enough, so get approx. SEs
mod_es2 = emfx(mod, type = "event", collapse = TRUE)
mod_es_compressed = emfx(mod, type = "event", compress = TRUE)
```

To put a fine point on it, we can can compare our original event study with the
collapsed estimates and see that the results are indeed very similar.
compressed estimates and see that the results are indeed very similar.

```{r}
modelsummary(
list("Original" = mod_es, "Collapsed" = mod_es2),
list("Original" = mod_es, "Compressed" = mod_es_compressed),
shape = term:event:statistic ~ model,
coef_rename = rename_fn,
gof_omit = "Adj|Within|IC|RMSE",
Expand Down

0 comments on commit ef3a6e5

Please sign in to comment.