-
-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wonky plot from check_model()
on a glmmTMB
example
#654
Comments
A quick guess is that inappropriate residuals are calculated. This is the code to detect overdispersion for this specific model: d <- data.frame(Predicted = stats::predict(model, type = "response"))
d$Residuals <- insight::get_response(model) - as.vector(d$Predicted)
d$Res2 <- d$Residuals^2
d$V <- d$Predicted * (1 + d$Predicted / insight::get_sigma(model))
d$StdRes <- insight::get_residuals(model, type = "pearson") And the qq-plot for glmmTMB simply uses If you don't have a suggestion for calculating the most appropriate residuals, the best solution is probably to go with simulated residuals and diagnostics based on DHARMa (#643) |
OK, the Q-Q plot should definitely be using I don't know enough about how the columns in the overdispersion machinery are being used downstream or why you need to compute the variance yourself - it seems like it should be possibly to get it more reliably with built-in |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@bwiernik I think you wrote most/all of the code for the overdisperson plot/check. |
Base R switched to using half-normnal for binomial and count models. I would suggest we keep using it for |
I think I just wrote one set of code there and didn't check if anything was already available for glmmTMB models. It would be good to use existing machinery there if possible. |
That would be poisson and binomial, but not nbinom? |
Ok - but I'm not sure how to revise the relevant code section. Happy if someone could make a proposal. |
Base R uses a half-normal with absolute standardized deviance residuals for any family of model fit with If we can't get standardized residuals, then we probably need to adjust the reference distribution to not be a standard normal/half-normal. |
Let me dig into these plots a bit and see what the most justifiable default should be. |
@bbolker we compute the per-observation variance - not sure how to do this with |
Hmm. In principle it would be nice if it were |
Theory on half-normal plot for deviance residuals https://www.jstatsoft.org/article/view/v081i10 |
This looks better for glmmTMB now, but overdispersion plot for glm.nb looks strange to me. Maybe it's correct, though? See #680 library(glmmTMB)
library(performance)
library(MASS)
library(dplyr) ## for mutate_at, %>%
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:MASS':
#>
#> select
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
x <- c("A", "B", "C", "D")
time <- rep(x, each = 20, times = 3) # time factor
y <- c("exposed", "ref1", "ref2")
lake <- rep(y, each = 80) # lake factor
set.seed(123)
min <- runif(n = 240, min = 4.5, max = 5.5) # mins used in model offset
set.seed(123)
count <- rnbinom(n = 240, mu = 10, size = 100) # randomly generated negative binomial data
# make data frame
dat <- as.data.frame(cbind(time, lake, min, count))
dat <- dat %>%
mutate_at(c("min", "count"), as.numeric)
# remove one combination of factors to make example rank deficient (all observations from time A and lake ref1)
dat2 <- filter(dat, time != "A" | lake != "ref1")
model <- glmmTMB(count ~ time * lake,
family = nbinom1,
control = glmmTMBControl(rank_check = "adjust"),
offset = log(min), data = dat2
)
#> dropping columns from rank-deficient conditional model: timeD:lakeref1
check_model(model)
#> `check_outliers()` does not yet support models of class `glmmTMB`. m2 <- glm.nb(count ~ time * lake + offset(log(min)), data = dat2)
check_model(m2) Created on 2024-02-05 with reprex v2.1.0 |
We still need to be careful. |
We can add a different handling for nbinom2, but what would be the solution in this case? 1/sigma? |
If the new code is correct, this would be the result. First, the new code: if (faminfo$is_negbin && !faminfo$is_zero_inflated) {
if (inherits(model, "glmmTMB")) {
d <- data.frame(Predicted = stats::predict(model, type = "response"))
d$Residuals <- insight::get_residuals(model, type = "pearson")
d$Res2 <- d$Residuals^2
d$StdRes <- insight::get_residuals(model, type = "pearson")
if (faminfo$family == "nbinom1") {
# for nbinom1, we can use "sigma()"
d$V <- insight::get_sigma(model)^2 * stats::family(model)$variance(d$Predicted)
} else {
# for nbinom2, "sigma()" has "inverse meaning" (see #654)
d$V <- (1 / insight::get_sigma(model)^2) * stats::family(model)$variance(d$Predicted)
}
} else {
## FIXME: this is not correct for glm.nb models?
d <- data.frame(Predicted = stats::predict(model, type = "response"))
d$Residuals <- insight::get_response(model) - as.vector(d$Predicted)
d$Res2 <- d$Residuals^2
d$V <- d$Predicted * (1 + d$Predicted / insight::get_sigma(model))
d$StdRes <- insight::get_residuals(model, type = "pearson")
}
} Result: model <- glmmTMB(count ~ time * lake,
family = nbinom1,
control = glmmTMBControl(rank_check = "adjust"),
offset = log(min), data = dat2
)
m3 <- update(model, family = nbinom2)
p1 <- plot(check_overdispersion(model))
p2 <- plot(check_overdispersion(m3))
p1 + p2 |
Looks good (although obviously |
While we're talking about variance, documentation etc.: This is the variance-function from beta-families: glmmTMB::beta_family()$variance
#> function (mu)
#> {
#> mu * (1 - mu)
#> }
#> <bytecode: 0x000001dc83b3ae58>
#> <environment: 0x000001dc83b3aa30> Could you clarify which one is correct? I used the one from the docs in |
A "variance" option in |
That's the thing: the sigma(model)^2*family(model)$variance(predict(model, type = "response")) works generally ... It could be worth making breaking changes to |
I think this is where my lack of statistical expertise prevents me from understanding how to proceed ;-) The initial code base in
This is, e.g., what is done for the beta-family, based on the docs (see my screenshot above): # Get distributional variance for beta-family
# ----------------------------------------------
.variance_family_beta <- function(x, mu, phi) {
if (inherits(x, "MixMod")) {
stats::family(x)$variance(mu)
} else {
mu * (1 - mu) / (1 + phi)
}
} This sometimes leads to weird results when computing R2 or ICC for mixed models. For most families, Is there some way to "validate" the results against something? E.g. against non-mixed or Bayesian models, or some simulated data where the R2 is known? (not sure how to simulate such data, though) |
Initial issue should be fixed, but re-open for the later discussed points here. |
Q-Q plot now based on DHARMa (see #643), but still need to think about overdispersion plots. library(glmmTMB)
library(performance)
library(datawizard)
# Build example data
x <- c("A", "B", "C", "D")
time <- rep(x, each = 20, times = 3) # time factor
y <- c("exposed", "ref1", "ref2")
lake <- rep(y, each = 80) # lake factor
set.seed(123)
min <- runif(n = 240, min = 4.5, max = 5.5) # mins used in model offset
set.seed(123)
count <- rnbinom(n = 240, mu = 10, size = 100) # randomly generated negative binomial data
# make data frame
dat <- as.data.frame(cbind(time, lake, min, count))
dat <- dat |>
data_modify(.at = c("min", "count"), .modify = as.numeric)
# remove one combination of factors to make example rank deficient (all observations from time A and lake ref1)
dat2 <- data_filter(dat, time != "A" | lake != "ref1")
model <- glmmTMB(count ~ time * lake,
family = nbinom1,
control = glmmTMBControl(rank_check = "adjust"),
offset = log(min), data = dat2
)
#> dropping columns from rank-deficient conditional model: timeD:lakeref1
check_model(model)
#> `check_outliers()` does not yet support models of class `glmmTMB`. Created on 2024-03-16 with reprex v2.1.0 |
@bwiernik Based on my comments here: #643 (comment) the question is whether we need to do anything regarding the code of the overdispersion plot? The current code relies on residuals / Pearson residuals. To check for over-/underdispersion in more complex models, we now use simulated residuals based on the DHARMa package. Can these residuals possibly be used for the code to create overdispersion plots? A draft to play with is performance/R/check_model_diagnostics.R Line 296 in 35b5e19
I'm not sure if this code really works well? I'm not fully understanding the implementation in performance/R/check_model_diagnostics.R Line 370 in 35b5e19
and how this "translated" into a function using simulated residuals? |
Or is the major concern still the variance function and/or sigma? |
I think we should be able to use the dharma residuals. Let me take a look |
This is from an
nbinom1
model - the "overdispersion" and "normality of residuals" plots both look odd ...The text was updated successfully, but these errors were encountered: