index.qmd

```{r}
#| label: setup
#| include: false

source(here::here("R/quarto-setup.R"))
```

<!-- badges: start -->
[![Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.](https://www.repostatus.org/badges/latest/inactive.svg)](https://www.repostatus.org/#inactive)
[![License: MIT](https://img.shields.io/badge/license-MIT-green)](https://choosealicense.com/licenses/mit/)
<!-- badges: end -->

## Overview

This document demonstrates the application of [general linear models](https://en.wikipedia.org/wiki/General_linear_model), with a focus on multiple regression. It utilizes the  [`penguins`](https://allisonhorst.github.io/palmerpenguins/reference/penguins.html) dataset from the [`palmerpenguins`](https://github.com/allisonhorst/palmerpenguins/) R package, which contains measurements of penguin species from the [Palmer Archipelago](https://en.wikipedia.org/wiki/Palmer_Archipelago). The dataset was originally introduced by @gorman2014.

::: {#fig-penguins-1}
![](images/palmer_penguins.png){fig-align="center" width="75%"}

Artwork by [Allison Horst](https://allisonhorst.com/).
:::

## Question

Every scientific investigation begins with a question. In this case, we will address the following:

**Can bill length and bill depth alone effectively predict flipper length in [Adélie penguins](https://en.wikipedia.org/wiki/Ad%C3%A9lie_penguin)?**

Imagine a debate between two marine biologists: one claims that bill length and depth could be used to predict flipper length, while the other disagrees.

To investigate this question, we will utilize a dataset from the [`palmerpenguins`](https://github.com/allisonhorst/palmerpenguins/) R package. The relevant variables are `bill_length_mm`,`bill_depth_mm` and `flipper_length_mm` for Adélie penguins. These variables are [defined](https://allisonhorst.github.io/palmerpenguins/reference/penguins.html) as follows:

- `bill_length_mm`: Numerical value representing the bill's length in millimeters.
- `bill_depth_mm`: Numerical value representing the bill's depth in millimeters.
- `flipper_length_mm`: Integer value representing the flipper's length in millimeters.

::: {#fig-penguins-2}
![](images/culmen_depth.png){fig-align="center"width="75%"}

Artwork by [Allison Horst](https://allisonhorst.com/).
:::

## Hypothesis

To approach our question, we will apply Popper’s hypothetico-deductive method, also known as the *method of conjecture and refutation* [@popper1979, p. 164]. The basic structure of this approach can be summarized as follows:

```{mermaid}
%%| label: fig-mermaid
%%| fig-cap: Simplified schema of Popper’s hypothetico-deductive method.
%%| fig-align: center

flowchart LR
  A(P1) --> B(TT)
  B --> C(EE)
  C --> D(P2)
```

"Here $\text{P}_1$, is the **problem** from which we start, $\text{TT}$ (the ‘tentative theory’) is the imaginative conjectural solution which we first reach, for example our first **tentative interpretation**. $\text{EE}$ (‘**error- elimination**’) consists of a severe critical examination of our conjecture, our tentative interpretation: it consists, for example, of the critical use of documentary evidence and, if we have at this early stage more than one conjecture at our disposal, it will also consist of a critical discussion and comparative evaluation of the competing conjectures. $\text{P}_2$ is the problem situation as it emerges from our first critical attempt to solve our problems.
It leads up to our second attempt (**and so on**)." [@popper1979, p. 164]

As our tentative theory or main hypothesis, I propose the following:

**Bill length and bill depth can effectively predict flipper length in Adélie penguins**.

As a procedure method, we will employ a method **inpired** by the Neyman-Pearson approach to data testing [@neyman1928; @neyman1928a; @perezgonzalez2015], evaluating the following hypotheses:

$$
\begin{cases}
\text{H}_{0}: \text{Bill length and bill depth cannot effectively predict flipper length in Adélie penguins} \\
\text{H}_{a}: \text{Bill length and bill depth can effectively predict flipper length in Adélie penguins}
\end{cases}
$$

::: {.callout-warning}
 Technically, our procedural method is not a strictly Neyman-Pearson acceptance test; we might refer to it as an improved [NHST](https://en.wikipedia.org/wiki/Statistical_hypothesis_test) (Null Hypothesis Significance Testing) approach, based on the original Neyman-Pearson ideas.
:::

## Methods

To test our hypothesis, we will use a general linear model with multiple regression analysis, evaluating the relationship between multiple predictors and a response variable. Here, the response variable is `flipper_length_mm`, while the predictors are `bill_length_mm` and `bill_depth_mm`.

To define what we mean by *effectively predict*" we will establish the following decision criteria:

- Predictors should exhibit a statistically significant association with the response variable.
- The model should satisfy all validity assumptions.
- The variance explained by the predictors ($\text{R}^{2}_{\text{adj}}$) must exceed 0.5, suggesting a strong association with the response variable ($\text{Cohen's } f^2 = \cfrac{0.5}{1 - 0.5} = 1$.)

This 0.5 threshold is not arbitrary; it represents the average level of variance explained in response variables during observational field studies in ecology, especially when there is limited control over factors influencing variance [@peek2003].

Finally, our hypothesis test can be systematized as follows:

$$
\begin{cases}
\text{H}_{0}: \text{R}^{2}_{\text{adj}} \leq 0.5 \\
\text{H}_{a}: \text{R}^{2}_{\text{adj}} > 0.5
\end{cases}
$$

In addition to an adjusted R-squared greater than 0.5, we will require predictors to show statistically significant associations and for the model to meet all assumptions.

We will set the significance level ($\alpha$) at 0.05, allowing a 5% chance of a [Type I error](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors). A power analysis will be performed to determine the necessary sample size for detecting a significant effect, targeting a power ($1 - \beta$) of 0.8.

Assumption checks will include:

- Assessing the normality of residuals through visual inspections, such as [Q-Q plots](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot), and statistical tests like the [Shapiro-Wilk test](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test).
- Evaluating homoscedasticity using tests like the [Breusch-Pagan test](https://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test) to ensure constant variance of residuals across predictor levels.

We will assess multicollinearity by calculating variance inflation factors ([VIF](https://en.wikipedia.org/wiki/Variance_inflation_factor)), with a VIF above 10 indicating potential issues. Influential points will be examined using [Cook's distance](https://en.wikipedia.org/wiki/Cook%27s_distance) and leverage values to identify any points that may disproportionately affect model outcomes.

::: {.callout-note}
It's important to emphasize that we are assessing predictive power, not establishing causality. Predictive models alone should never be used to infer causal relationships [@arif2022].
:::

##  An overview of general linear models

Before proceeding, let's briefly overview general linear models, with a focus on multiple regression analysis.

"[...] A problem of this type is called a problem of multiple linear regression because we are considering the regression of $Y$ on $k$ variables $X_{1}, \dots, X_{k}$, rather than on just a single variable $X$, and we are assuming also that this regression is a linear function of the parameters $\beta_{0}, \dots, \beta_{k}$. In a problem of multiple linear regressions, we obtain $n$ vectors of observations ($x_{i1}. \dots, x_{ik}, Y_{i}$), for $i = 1, \dots, n$. Here $x_{ij}$ is the observed value of the variable $X_{j}$ for the $i$th observation. The $E(Y)$ is given by the relation

$$
E(Y_{i}) = \beta_{0} + \beta_{1} x_{i1} + \dots + \beta_{k} x_{ik}
$$

[@degroot2012, p. 738]

### Definitions

Residuals/Fitted values
: \hspace{20cm} For $i = 1, \dots, n$, the observed values of $\hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1} x_{i}$ are called _fitted values_. For $i = 1, \dots, n$, the observed values of $e_{i} = y_{i} - \hat{y}_{i}$ are called _residuals_ [@degroot2012, p. 717].

"[...] regression problems in which the observations $Y_{i}, \dots, Y_{n}$ [...] we shall assume that each observation $Y_{i}$ has a normal distribution, that the observations $Y_{1}, \dots, Y_{n}$ are independent, and that the observations $Y_{1}, \dots, Y_{n}$ have the same variance $\sigma^{2}$. Instead of a single predictor being associated with each $Y_{i}$, we assume that a $p$-dimensional vector $z_{i} = (z_{i0}, \dots, z_{ip - 1})$ is associated with each $Y_{i}$"  [@degroot2012, p. 736].

General linear model
: The statistical model in which the observations $Y_{1}, \dots, Y_{n}$ satisfy the following assumptions [@degroot2012, p. 738].

### Assumptions

Assumption 1
: \hspace{20cm} __Predictor is known__. Either the vectors $z_{1}, \dots , z_{n}$ are known ahead of time, or they are the observed values of random vectors $Z_{1}, \dots , Z_{n}$ on whose values we condition before computing the joint distribution of ($Y_{1}, \dots , Y_{n}$) [@degroot2012, p. 736].

Assumption 2
: \hspace{20cm} __Normality__. For $i = 1, \dots, n$, the conditional distribution of $Y_{i}$ given the vectors $z_{1}, \dots , z_{n}$ is a normal distribution [@degroot2012, p. 737].

(Normality of the error term distribution [@hair2019, p. 287])

Assumption 3
: \hspace{20cm} __Linear mean__. There is a vector of parameters  $\beta = (\beta_{0}, \dots, \beta_{p - 1})$ such that the conditional mean of $Y_{i}$ given the values $z_{1}, \dots , z_{n}$ has the form

$$
z_{i0} \beta_{0} + z_{i1} \beta_{1} + \cdots + z_{ip - 1} \beta_{p - 1}
$$

for $i = 1, \dots, n$ [@degroot2012, p. 737].

(Linearity of the phenomenon measured [@hair2019, p. 287])

::: {.callout-warning}
It is important to clarify that the linear assumption pertains to **linearity in the parameters** or equivalently, linearity in the coefficients. This means that each predictor is multiplied by its corresponding regression coefficient. However, this does not imply that the relationship between the predictors and the response variable is linear. In fact, a linear model can still effectively capture non-linear relationships between predictors and the response variable by utilizing transformations of the predictors [@cohen2002].
:::

Assumption 4
: \hspace{20cm} __Common variance__ (homoscedasticity). There is as parameter $\sigma^{2}$ such the conditional variance of $Y_{i}$ given the values $z_{1}, \dots , z_{n}$ is $\sigma^{2}$ for $i = 1, \dots, n$.

(Constant variance of the error terms [@hair2019, p. 287])

Assumption 5
: \hspace{20cm} __Independence__. The random variables $Y_{1}, \dots , Y_{n}$ are independent given the observed $z_{1}, \dots , z_{n}$ [@degroot2012, p. 737].

(Independence of the error terms [@hair2019, p. 287])

## Setting up the environment

```{r}
#| eval: false
#| code-fold: true

library(broom)
library(car)
library(checkmate)
library(cowplot)
library(dplyr)
library(effectsize)
library(fBasics)
library(forecast)
library(ggeffects)
library(GGally)
library(ggplot2)
library(ggpmisc)
library(ggplotify)
library(qqplotr)
library(ggPredict)
library(glue)
library(insight)
library(janitor)
library(latex2exp)
library(magrittr)
library(moments)
library(nortest)
library(olsrr)
library(palmerpenguins)
library(parameters)
library(parsnip)
library(performance)
library(predict3d)
library(psychometric)
library(pwrss)
library(recipes)
library(report)
library(rgl)
library(rutils)
library(sandwich)
library(stats)
library(stringr)
library(tidyr)
library(tseries)
library(viridis)
library(workflows)
```

```{r}
#| include: false

library(magrittr)
```

```{r}
#| code-fold: true

gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}
```

```{r}
#| code-fold: true

lm_fun <- function(model, fix_all_but = NULL, data = NULL) {
  checkmate::assert_class(model, "lm")
  
  checkmate::assert_number(
    fix_all_but, 
    lower = 1,
    upper = length(stats::coef(model)) - 1, 
    null.ok = TRUE
  )
  
  coef <- broom::tidy(fit)
  vars <- letters[seq_len((nrow(coef) - 1))]
  
  fixed_vars <- vars
  
  if (!is.null(fix_all_but)) {
    checkmate::assert_data_frame(data)
    checkmate::assert_subset(coef$term[-1], names(data))
    
    for (i in seq_along(fixed_vars)[-fix_all_but]) {
      fixed_vars[i] <- mean(data[[coef$term[i + 1]]], na.rm = TRUE)
    }
    
    vars <- vars[fix_all_but]
  }
  
  fun_exp <- str2expression(
      glue::glue(
        "function({paste0(vars, collapse = ', ')}) {{", "\n",
        "  {paste0('checkmate::assert_numeric(', vars, ')', collapse = '\n')}",
        "\n\n",
        "  {coef$estimate[1]} +",
        "{paste0(coef$estimate[-1], ' * ', fixed_vars, collapse = ' + ')}",
        "\n",
        "}}"
      )
    )
  
  out <- eval(fun_exp)
  
  out
}
```

```{r}
#| code-fold: true

lm_str_fun <- function(
    model, 
    digits = 3,
    latex2exp = TRUE,
    fix_all_but = NULL, # Ignore the intercept coefficient.
    fix_fun = "Mean",
    coef_names = NULL # Ignore the intercept coefficient.
  ) {
  checkmate::assert_class(model, "lm")
  checkmate::assert_number(digits)
  checkmate::assert_flag(latex2exp)
  
  checkmate::assert_number(
    fix_all_but, 
    lower = 1,
    upper = length(stats::coef(model)) - 1, 
    null.ok = TRUE
  )
  
  checkmate::assert_string(fix_fun)
  
  checkmate::assert_character(
    coef_names, 
    any.missing = FALSE, 
    len = length(names(stats::coef(model))) - 1,
    null.ok = TRUE
  )
  
  if (is.null(coef_names)) coef_names <- names(stats::coef(model))[-1]
  
  coef <- list()
  
  for (i in seq_along(coef_names)) {
    coef[[coef_names[i]]] <- 
      stats::coef(model) |> 
      magrittr::extract(i + 1) |>
      rutils:::clear_names() |>
      round(digits)
  }
  
  coef_names <-
    coef_names |>
    stringr::str_replace_all("\\_|\\.", " ") |>
    stringr::str_to_title() |>
    stringr::str_replace(" ", "")
  
  if (!is.null(fix_all_but)) {
    for (i in seq_along(coef_names)[-fix_all_but]) {
      coef_names[i] <- paste0(fix_fun, "(", coef_names[i], ")")
    }
  }
  
  out <- paste0(
    "$", "y =", " ", 
    round(stats::coef(model)[1], digits), " + ",
    paste0(coef, " \\times ", coef_names, collapse = " + "),
    "$"
  )
  
  out <- out |> stringr::str_replace("\\+ \\-", "\\- ") 
  
  if (isTRUE(latex2exp)) {
    out |>latex2exp::TeX()
  } else {
    out
  }
}
```

```{r}
#| code-fold: true

test_outlier <- function(
    x, 
    method = "iqr", 
    iqr_mult = 1.5, 
    sd_mult = 3
  ) {
  checkmate::assert_numeric(x)
  checkmate::assert_choice(method, c("iqr", "sd"))
  checkmate::assert_number(iqr_mult)
  checkmate::assert_number(sd_mult)

  if (method == "iqr") {
    iqr <- stats::IQR(x, na.rm = TRUE)
    min <- stats::quantile(x, 0.25, na.rm = TRUE) - (iqr_mult * iqr)
    max <- stats::quantile(x, 0.75, na.rm = TRUE) + (iqr_mult * iqr)
  } else if (method == "sd") {
    min <- mean(x, na.rm = TRUE) - (sd_mult * stats::sd(x, na.rm = TRUE))
    max <- mean(x, na.rm = TRUE) + (sd_mult * stats::sd(x, na.rm = TRUE))
  }

  dplyr::if_else(x >= min & x <= max, FALSE, TRUE, missing = FALSE)
}
```

```{r}
#| code-fold: true

remove_outliers <- function(
    x, 
    method = "iqr", 
    iqr_mult = 1.5, 
    sd_mult = 3
  ) {
  checkmate::assert_numeric(x)
  checkmate::assert_choice(method, c("iqr", "sd"))
  checkmate::assert_number(iqr_mult, lower = 1)
  checkmate::assert_number(sd_mult, lower = 0)

  x |>
    test_outlier(
      method = method, 
      iqr_mult = iqr_mult, 
      sd_mult = sd_mult
    ) %>%
    `!`() %>%
    magrittr::extract(x, .)
}
```

```{r}
#| code-fold: true

list_as_tibble <- function(list) {
  checkmate::assert_list(list)

  list |>
    dplyr::as_tibble() |>
    dplyr::mutate(
      dplyr::across(
        .cols = dplyr::everything(),
        .fns = as.character
      )
    ) |>
    tidyr::pivot_longer(cols = dplyr::everything())
}
```

```{r}
#| code-fold: true

stats_sum <- function(
    x,
    name = NULL,
    na_rm = TRUE,
    remove_outliers = FALSE,
    iqr_mult = 1.5,
    as_list = FALSE
  ) {
  checkmate::assert_numeric(x)
  checkmate::assert_string(name, null.ok = TRUE)
  checkmate::assert_flag(na_rm)
  checkmate::assert_flag(remove_outliers)
  checkmate::assert_number(iqr_mult, lower = 1)
  checkmate::assert_flag(as_list)

  if (isTRUE(remove_outliers)) {
    x <- x |> remove_outliers(method = "iqr", iqr_mult = iqr_mult)
  }

  out <- list(
    n = length(x),
    n_rm_na = length(x[!is.na(x)]),
    n_na = length(x[is.na(x)]),
    mean = mean(x, na.rm = na_rm),
    var = stats::var(x, na.rm = na_rm),
    sd = stats::sd(x, na.rm = na_rm),
    min = rutils:::clear_names(stats::quantile(x, 0, na.rm = na_rm)),
    q_1 = rutils:::clear_names(stats::quantile(x, 0.25, na.rm = na_rm)),
    median = rutils:::clear_names(stats::quantile(x, 0.5, na.rm = na_rm)),
    q_3 = rutils:::clear_names(stats::quantile(x, 0.75, na.rm = na_rm)),
    max = rutils:::clear_names(stats::quantile(x, 1, na.rm = na_rm)),
    iqr = IQR(x, na.rm = na_rm),
    skewness = moments::skewness(x, na.rm = na_rm),
    kurtosis = moments::kurtosis(x, na.rm = na_rm)
  )

  if (!is.null(name)) out <- append(out, list(name = name), after = 0)
  
  if (isTRUE(as_list)) {
    out
  } else {
    out |> list_as_tibble()
  }
}
```

```{r}
#| code-fold: true

plot_qq <- function(
    x,
    text_size = NULL,
    na_rm = TRUE,
    print = TRUE
  ) {
  checkmate::assert_numeric(x)
  checkmate::assert_number(text_size, null.ok = TRUE)
  checkmate::assert_flag(na_rm)
  checkmate::assert_flag(print)

  if (isTRUE(na_rm)) x <- x |> rutils:::drop_na()

  plot <-
    dplyr::tibble(y = x) |>
    ggplot2::ggplot(ggplot2::aes(sample = y)) +
    ggplot2::stat_qq() +
    ggplot2::stat_qq_line(color = "red", linewidth = 1) +
    ggplot2::labs(
      x = "Theoretical quantiles (Std. normal)",
      y = "Sample quantiles"
    ) +
    ggplot2::theme(text = ggplot2::element_text(size = text_size))

  if (isTRUE(print)) print(plot)
  
  invisible(plot)
}
```

```{r}
#| code-fold: true

plot_hist <- function(
    x,
    name = "x",
    bins = 30,
    stat = "density",
    text_size = NULL,
    density_line = TRUE,
    na_rm = TRUE,
    print = TRUE
  ) {
  checkmate::assert_numeric(x)
  checkmate::assert_string(name)
  checkmate::assert_number(bins, lower = 1)
  checkmate::assert_choice(stat, c("count", "density"))
  checkmate::assert_number(text_size, null.ok = TRUE)
  checkmate::assert_flag(density_line)
  checkmate::assert_flag(na_rm)
  checkmate::assert_flag(print)

  if (isTRUE(na_rm)) x <- x |> rutils:::drop_na()
  y_lab <- ifelse(stat == "count", "Frequency", "Density")

  plot <-
    dplyr::tibble(y = x) |>
    ggplot2::ggplot(ggplot2::aes(x = y)) +
    ggplot2::geom_histogram(
      ggplot2::aes(y = ggplot2::after_stat(!!as.symbol(stat))),
      bins = 30, 
      color = "white"
    ) +
    ggplot2::labs(x = name, y = y_lab) +
    ggplot2::theme(text = ggplot2::element_text(size = text_size))

  if (stat == "density" && isTRUE(density_line)) {
    plot <- plot + ggplot2::geom_density(color = "red", linewidth = 1)
  }

  if (isTRUE(print)) print(plot)
  
  invisible(plot)
}
```

```{r}
#| code-fold: true

plot_ggally <- function(
    data,
    cols = names(data),
    mapping = NULL,
    axis_labels = "none",
    na_rm = TRUE,
    text_size = NULL
  ) {
  checkmate::assert_tibble(data)
  checkmate::assert_character(cols)
  checkmate::assert_subset(cols, names(data))
  checkmate::assert_class(mapping, "uneval", null.ok = TRUE)
  checkmate::assert_choice(axis_labels, c("show", "internal", "none"))
  checkmate::assert_flag(na_rm)
  checkmate::assert_number(text_size, null.ok = TRUE)

  out <-
    data|>
    dplyr::select(dplyr::all_of(cols))|>
    dplyr::mutate(
      dplyr::across(
      .cols = dplyr::where(hms::is_hms),
      .fns = ~ midday_trigger(.x)
      ),
      dplyr::across(
        .cols = dplyr::where(
          ~ !is.character(.x) && !is.factor(.x) && !is.numeric(.x)
        ),
        .fns = ~ as.numeric(.x)
      )
    )

  if (isTRUE(na_rm)) out <- out|> tidyr::drop_na(dplyr::all_of(cols))

  if (is.null(mapping)) {
    plot <-
      out|>
      GGally::ggpairs(
        lower = list(continuous = "smooth"),
        axisLabels = axis_labels
      ) 
  } else {
    plot <-
      out|>
      GGally::ggpairs(
        mapping = mapping,
        axisLabels = axis_labels
      ) +
      viridis::scale_color_viridis(
        begin = 0.25,
        end = 0.75,
        discrete = TRUE,
        option = "viridis"
      ) +
      viridis::scale_fill_viridis(
        begin = 0.25,
        end = 0.75,
        discrete = TRUE,
        option = "viridis"
      )
  }

  plot <- 
    plot +
    ggplot2::theme(text = ggplot2::element_text(size = text_size))

  print(plot)
  
  invisible(plot)
}
```

```{r}
#| code-fold: true

test_normality <- function(x,
                           name = "x",
                           remove_outliers = FALSE,
                           iqr_mult = 1.5,
                           log_transform = FALSE,
                           density_line = TRUE,
                           text_size = NULL,
                           print = TRUE) {
  checkmate::assert_numeric(x)
  checkmate::assert_string(name)
  checkmate::assert_flag(remove_outliers)
  checkmate::assert_number(iqr_mult, lower = 1)
  checkmate::assert_flag(log_transform)
  checkmate::assert_flag(density_line)
  checkmate::assert_number(text_size, null.ok = TRUE)
  checkmate::assert_flag(print)

  n <- x |> length()
  n_rm_na <- x |> rutils:::drop_na() |> length()

  if (isTRUE(remove_outliers)) {
    x <- x |> remove_outliers(method = "iqr", iqr_mult = iqr_mult)
  }

  if (isTRUE(log_transform)) {
    x <-
      x |>
      log() |>
      drop_inf()
  }

  if (n_rm_na >= 7) {
    ad <- x |> nortest::ad.test()

    cvm <-
      x |>
      nortest::cvm.test() |>
      rutils::shush()
  } else {
    ad <- NULL
    cmv <- NULL
  }

  bonett <- x |> moments::bonett.test()

  # See also `Rita::DPTest()` (just for Omnibus (K) tests).
  dagostino <-
    x |>
    fBasics::dagoTest() |>
    rutils::shush()

  jarque_bera <-
    rutils:::drop_na(x) |>
    tseries::jarque.bera.test()

  if (n_rm_na >= 4) {
    lillie_ks <- x |> nortest::lillie.test()
  } else {
    lillie_ks <- NULL
  }

  pearson <- x |> nortest::pearson.test()

  if (n_rm_na >= 5 && n_rm_na <= 5000) {
    sf <- x |> nortest::sf.test()
  } else {
    sf <- NULL
  }

  if (n_rm_na >= 3 && n_rm_na <= 3000) {
    shapiro <- x |> stats::shapiro.test()
  } else {
    shapiro <- NULL
  }

  qq_plot <- x |> plot_qq(text_size = text_size, print = FALSE)

  hist_plot <-
    x |>
    plot_hist(
      name = name,
      text_size = text_size,
      density_line = density_line,
      print = FALSE
      )

  grid_plot <- cowplot::plot_grid(hist_plot, qq_plot, ncol = 2, nrow = 1)

  out <- list(
    stats = stats_sum(
      x,
      name = name,
      na_rm = TRUE,
      remove_outliers = FALSE,
      as_list = TRUE
    ),
    params = list(
      name = name,
      remove_outliers = remove_outliers,
      log_transform = log_transform,
      density_line = density_line
    ),

    ad = ad,
    bonett = bonett,
    cvm = cvm,
    dagostino = dagostino,
    jarque_bera = jarque_bera,
    lillie_ks = lillie_ks,
    pearson = pearson,
    sf = sf,
    shapiro = shapiro,

    hist_plot = hist_plot,
    qq_plot = qq_plot,
    grid_plot = grid_plot
  )

  if (isTRUE(print)) print(grid_plot)

  invisible(out)
}
```

```{r}
#| code-fold: true

normality_sum <- function(
    x, 
    round = FALSE, 
    digits = 5, 
    only_p_value = FALSE, 
    ...
  ) {
  checkmate::assert_numeric(x)
  checkmate::assert_flag(round)
  checkmate::assert_number(digits)
  checkmate::assert_flag(only_p_value)

  stats <- test_normality(x, print = FALSE, ...)

  out <- dplyr::tibble(
    test = c(
      "Anderson-Darling",
      "Bonett-Seier",
      "Cramér-von Mises",
      "D'Agostino Omnibus Test",
      "D'Agostino Skewness Test",
      "D'Agostino Kurtosis Test",
      "Jarque–Bera",
      "Lilliefors (K-S)",
      "Pearson chi-square",
      "Shapiro-Francia",
      "Shapiro-Wilk"
    ),
    statistic_1 = c(
      stats$ad$statistic,
      stats$bonett$statistic[1],
      stats$cvm$statistic,
      attr(stats$dagostino, "test")$statistic[1],
      attr(stats$dagostino, "test")$statistic[2],
      attr(stats$dagostino, "test")$statistic[3],
      stats$jarque_bera$statistic,
      stats$lillie_ks$statistic,
      stats$pearson$statistic,
      ifelse(is.null(stats$shapiro), NA, stats$shapiro$statistic),
      ifelse(is.null(stats$sf), NA, stats$sf$statistic)
    ),
    statistic_2 = c(
      as.numeric(NA),
      stats$bonett$statistic[2],
      as.numeric(NA),
      as.numeric(NA),
      as.numeric(NA),
      as.numeric(NA),
      stats$jarque_bera$parameter,
      as.numeric(NA),
      as.numeric(NA),
      as.numeric(NA),
      as.numeric(NA)
    ),
    p_value = c(
      stats$ad$p.value,
      stats$bonett$p.value,
      stats$cvm$p.value,
      attr(stats$dagostino, "test")$p.value[1],
      attr(stats$dagostino, "test")$p.value[2],
      attr(stats$dagostino, "test")$p.value[3],
      stats$jarque_bera$p.value,
      stats$lillie_ks$p.value,
      stats$pearson$p.value,
      ifelse(is.null(stats$shapiro), NA, stats$shapiro$p.value),
      ifelse(is.null(stats$sf), NA, stats$sf$p.value)
    )
  )

  if (isTRUE(only_p_value)) out <- out |> dplyr::select(test, p_value)
  
  if (isTRUE(round)) {
    out |>
      dplyr::mutate(
        dplyr::across(
          .cols = dplyr::where(is.numeric),
          .fns = ~ round(.x, digits)
        ))
  } else {
    out
  }
}
```

## Preparing the data

**Assumption 1** is satisfied, as the predictors are known.

```{r}
#| code-fold: true

data <- 
  palmerpenguins::penguins |> 
  dplyr::filter(species == "Adelie") |>
  dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm) |>
  tidyr::drop_na()
```

::: {#tbl-data-prep-1}
```{r}
#| code-fold: true

data
```

Data frame with morphological measurements of penguin species from the [Palmer Archipelago](https://en.wikipedia.org/wiki/Palmer_Archipelago).
:::

```{r}
report::report(data)
```

## Performing a power analysis

First we will perform a *a posteriori* power analysis to determine the sample size needed to achieve a power ($1 - \beta$) of 0.8, given an $R^2$ of 0.5, a significance level ($\alpha$) of 0.05, and 2 predictors. It's a *a posterior* analysis because we already have the data in hand. It's a good practice to perform a power analysis before running the model to ensure that the sample size is adequate.

The results show that we need at least 14 observations for each variable to achieve the desired power. We have 151 observations, which is more than enough.

```{r}
#| code-fold: true

pre_pwr <- pwrss::pwrss.f.reg(
  r2 = 0.5, 
  k = 2,
  power = 0.80,
  alpha = 0.05
)
```

```{r}
#| code-fold: true

pwrss::power.f.test(
  ncp = pre_pwr$ncp,
  df1 = pre_pwr$df1,
  df2 = pre_pwr$df2,
  alpha = pre_pwr$parms$alpha,
  plot = TRUE
)
```

Is the data size greater or equal to the required size?

```{r}
data |> 
  tidyr::drop_na() |> 
  nrow() |>
  magrittr::is_weakly_greater_than(pre_pwr$n)
```

## Checking distributions

The data show fairly normal distributions. It seems that the `bill_length_mm` and `bill_depth_mm` variables appear slightly skewed, while the `flipper_length_mm` variable is more symmetric.

:::: {.panel-tabset}
### Bill length (mm)

::: {#tbl-var-dist-stats-sum-bill_length_mm}
```{r}
#| code-fold: true

data |>
  dplyr::pull(bill_length_mm) |> 
  stats_sum(name = "Bill length (mm)")
```
Statistics for the `bill_length_mm` variable.
:::

::: {#fig-var-dist-hist-bill_length_mm}
```{r}
#| code-fold: true

data |> 
  dplyr::pull(bill_length_mm) |> 
  test_normality(name = "Bill length (mm)")
```

Histogram of the `bill_length_mm` variable with a kernel density estimate, along with a quantile-quantile (Q-Q) plot between the variable and the theoretical quantiles of the normal distribution.
:::

### Bill depth (mm)

::: {#tbl-var-dist-stats-sum-bill_depth_mm}
```{r}
#| code-fold: true

data |> 
  dplyr::pull(bill_depth_mm) |> 
  stats_sum(name = "Bill depth (mm)")
```

Summary statistics for the `bill_depth_mm` variable.
:::

::: {#fig-var-dist-hist-bill_depth_mm}
```{r}
#| code-fold: true

data |> 
  dplyr::pull(bill_depth_mm) |> 
  test_normality(name = "Bill depth (mm)")
```

Histogram of the `bill_depth_mm` variable with a kernel density estimate, along with a quantile-quantile (Q-Q) plot between the variable and the theoretical quantiles of the normal distribution.
:::

### Flipper length (mm)

::: {#tbl-var-dist-stats-sum-flipper_length_mm}
```{r}
#| code-fold: true

data |> 
  dplyr::pull(flipper_length_mm) |> 
  stats_sum(name = "Flipper length (mm)")
```

Summary statistics for the `flipper_length_mm` variable.
:::

::: {#fig-var-dist-hist-flipper_length_mm}
```{r}
#| code-fold: true

data |> 
  dplyr::pull(flipper_length_mm) |> 
  test_normality(name = "Flipper length (mm)")
```

Histogram of the `flipper_length_mm` variable with a kernel density estimate, along with a quantile-quantile (Q-Q) plot between the variable and the theoretical quantiles of the normal distribution.
:::
::::

## Checking correlations

Both `bill_length_mm` and `bill_depth_mm` are positively correlated with `flipper_length_mm` in a significant manner. No non-linear relationships are observed.

::: {#fig-correlations-correlation-matrix}
```{r}
#| code-fold: true

data |> 
  plot_ggally() |> 
  rutils::shush()
```

Correlation matrix of `bill_length_mm`, `bill_depth_mm`, and `flipper_length_mm` variables.
:::

## Checking for outliers

A few minor outliers are present in the `bill_length_mm` and `flipper_length_mm` variables. We could remove these outliers, but they are not extreme and do not appear to be errors.

A few observations were flagged with Cook’s D values greater than the threshold. One of them (129th) was particularly influential, so it was removed from the dataset.

### Boxplots

```{r}
#| code-fold: true

colors <- gg_color_hue(3)
```

::: {#fig-outliers-1}
```{r}
#| code-fold: true

data |> 
  tidyr::pivot_longer(-flipper_length_mm) |>
  ggplot2::ggplot(ggplot2::aes(x = name, y = value, fill = name)) +
  ggplot2::geom_boxplot(
    outlier.colour = "red", 
    outlier.shape = 1,
    width = 0.75
  ) +
  ggplot2::geom_jitter(width = 0.3, alpha = 0.1, color = "black", size = 0.5) +
  ggplot2::labs(x = "Variable", y = "Value", fill = ggplot2::element_blank()) +
  ggplot2::scale_fill_manual(
    labels = c("Bill length (mm)", "Bill depth (mm)"),
    breaks = c("bill_length_mm", "bill_depth_mm"),
    values = gg_color_hue(3)[1:2]
  ) +
  ggplot2::coord_flip() +
  ggplot2::theme(
    axis.title.y = ggplot2::element_blank(),
    axis.text.y = ggplot2::element_blank(),
    axis.ticks.y = ggplot2::element_blank()
  )
```

Boxplots of the `bill_length_mm` and `bill_depth_mm` variables, with outliers highlighted in red and other data indicated by jittered points.
:::

::: {#fig-outliers-2}
```{r}
#| code-fold: true

data |> 
  tidyr::pivot_longer(flipper_length_mm) |>
  ggplot2::ggplot(ggplot2::aes(x = name, y = value, fill = name)) +
  ggplot2::geom_boxplot(
    outlier.colour = "red", 
    outlier.shape = 1,
    width = 0.5
  ) +
  ggplot2::geom_jitter(width = 0.2, alpha = 0.1, color = "black", size = 0.5) +
  ggplot2::scale_fill_manual(
    labels = "Flipper length (mm)",
    values = gg_color_hue(3) |> dplyr::last()
  ) +
  ggplot2::labs(x = "Variable", y = "Value", fill = ggplot2::element_blank()) +
  ggplot2::coord_flip() +
  ggplot2::theme(
    axis.title.y = ggplot2::element_blank(),
    axis.text.y = ggplot2::element_blank(),
    axis.ticks.y = ggplot2::element_blank()
  )
```

Boxplot of the `flipper_length_mm` variable, with outliers highlighted in red and other data indicated by jittered points.
:::

### Cook's distance

The [Cook's distance](https://en.wikipedia.org/wiki/Cook%27s_distance) measures each observation's influence on the model's fitted values. It is considered one of the most representative metrics for assessing overall influence [@hair2019].

A common practice is to flag observations with a Cook's distance of 1.0 or greater. However, a more conservative threshold of $4 / (n - k - 1)$, where $n$ is the sample size and $k$ is the number of independent variables, is suggested as a more conservative measure in small samples or for use with larger datasets [@hair2019].

Learn more about Cook's D in: @cook1977; @cook1979.

```{r}
cooks_d_cut_off <- 4 / (nrow(data) - 2 - 1)

cooks_d_cut_off
```

```{r}
form <- formula(flipper_length_mm ~ bill_length_mm + bill_depth_mm)
```

```{r}
fit <- lm(form, data = data)
```

```{r}
#| code-fold: true

cooks_obs <- 
  fit |> 
  stats::cooks.distance() %>%
  magrittr::is_greater_than(cooks_d_cut_off) |>
  which() |>
  `names<-`(NULL)

fit |>
  stats::cooks.distance() |>
  magrittr::extract(cooks_obs)
```

::: {#fig-outliers-3}
```{r}
#| code-fold: true

plot <- 
  fit |> 
  olsrr::ols_plot_cooksd_bar(type = 2, print_plot = FALSE)

# The following procedure changes the plot aesthetics.
q <- plot$plot + ggplot2::labs(title = ggplot2::element_blank())
q <- q |> ggplot2::ggplot_build()
q$data[[5]]$label <- ""

q |> ggplot2::ggplot_gtable() |> ggplotify::as.ggplot()
```

Cook's distance for each observation along with a threshold line at $4 / (n - k - 1)$.
:::

### Outlier removal

Outlier detection methods indicate which observations are unusual or influential, and it's our job to determine why certain observations stand out [@struck2024]. In this scenario, a [observational error](https://en.wikipedia.org/wiki/Observational_error) or a unique characteristic of the penguin species might be causing some distortions.

For practical reasons, I will remove the 129th observation for now. However, it's crucial to review the model assumptions before making this decision. Violating an assumption might cause many observations to be incorrectly labeled as outliers.

```{r}
data <- data |> dplyr::slice(-129)
```

## Fitting the model

```{r}
recipe <- 
  data |>
  recipes::recipe(form) |>
  rutils::shush()
```

```{r}
model <- 
  parsnip::linear_reg() |> 
  parsnip::set_engine("lm") |>
  parsnip::set_mode("regression")
```

```{r}
#| code-fold: true

workflow <- 
  workflows::workflow() |>
  workflows::add_recipe(recipe) |>
  workflows::add_model(model)
```

```{r}
fit <- workflow |> parsnip::fit(data)
```

::: {#tbl-model-fit-1}
```{r}
#| code-fold: true

fit |>
  broom::tidy() |> 
  janitor::adorn_rounding(5)
```

Output from the model fitting process showing the estimated coefficients, standard errors, test statistics, and p-values for the terms in the linear regression model.
:::

::: {#tbl-model-fit-2}
```{r}
#| code-fold: true

fit |> 
  broom::augment(data) |>
  janitor::adorn_rounding(5)
```

Model summary table displaying predictions and residuals, along with the variables used in the model.
:::

::: {#tbl-model-fit-3}
```{r}
#| code-fold: true

fit |> 
  broom::glance() |> 
  tidyr::pivot_longer(cols = dplyr::everything()) |>
  janitor::adorn_rounding(10)
```

Summary of model fit statistics showing key metrics including R-squared, adjusted R-squared, sigma, statistic, p-value, degrees of freedom, log-likelihood, AIC, BIC, and deviance.
:::

```{r}
fit_engine <- fit |> parsnip::extract_fit_engine()

fit_engine |> summary()
```

::: {#tbl-model-fit-4}
```{r}
#| code-fold: true

fit_engine |> parameters::standardize_parameters()
```

Standardized model parameters (coefficients) along with their ranges, based on a 95% confidence interval.
:::

```{r}
#| code-fold: true

# A jerry-rigged solution to fix issues related to modeling using the pipe.

fit_engine_2 <- lm(form, data = data)
```

```{r}
report::report(fit_engine_2)
```

## Inspecting the model fit

### Predictions

In a multiple linear regression with two predictors, the model is fit by adjusting a plane to the data points.

```{r}
#| eval: false
#| include: false

# Source: https://stackoverflow.com/a/70979149/8258804

# To find the `theta` and `phi` angles, do the following:
# 
# * Install the `orientlib` package before doing this.
# 
# 1. Run the chunk below to get a viewport open.
# 2. Adjust the size of the viewport to the size of the Quarto chart rendering.
# 3. Run `user_matrix <- rgl::par3d()$userMatrix` to get the user matrix.
# 4. Run `zoom <- rgl::par3d()$zoom` to get the zoom.
# 5. Add `user_matrix` and `zoom` to the `rgl::view3d()` function.

# install.packages("orientlib")
# user_matrix <- rgl::par3d()$userMatrix
# zoom <- rgl::par3d()$zoom
```

::: {#fig-model-fit-comparison-1}
```{r}
#| code-fold: true

user_matrix <-
  dplyr::tribble(
    ~a,         ~b,         ~c,          ~d,
    0.6233152,  -0.7817951, -0.01657271, 0,
    0.1739255,  0.1179437,  0.97767037,  0,
    -0.7623830, -0.6122792, 0.20949011,  0,
    0,          0,          0,           1
  ) |>
  as.matrix() |>
  `colnames<-`(NULL)

fit_engine |>
  predict3d::predict3d(
    xlab = "Bill length (mm)",
    ylab = "Bill depth (mm)",
    zlab = "Flipper length (mm)",
    radius = 0.75,
    type = "s",
    color = "red",
    show.subtitle = FALSE,
    show.error = FALSE
  )

rgl::view3d(userMatrix = user_matrix, zoom = 0.9)

rgl::rglwidget(elementId = "1st") |> rutils::shush()
```

A 3D visualization of the fitted model: the plane represents the model, while the points represent the observed data. **Use the mouse to explore**.
:::

::: {#fig-model-fit}
```{r}
#| code-fold: true

limits <- 
  stats::predict(fit_engine, interval = "prediction") |>
  dplyr::as_tibble() |>
  rutils::shush()

fit |>
  broom::augment(data) |>
  dplyr::bind_cols(limits) |>
  ggplot2::ggplot(ggplot2::aes(flipper_length_mm, .pred)) +
  # ggplot2::geom_ribbon(
  #   mapping = ggplot2::aes(ymin = lwr, ymax = upr),
  #   alpha = 0.2
  # ) +
  ggplot2::geom_ribbon(
    mapping = ggplot2::aes(
      ymin = stats::predict(stats::loess(lwr ~ flipper_length_mm)),
      ymax = stats::predict(stats::loess(upr ~ flipper_length_mm)),
    ),
    alpha = 0.2
  ) +
  ggplot2::geom_smooth(
    mapping = ggplot2::aes(y = lwr),
    se = FALSE,
    method = "loess",
    formula = y ~ x,
    linetype = "dashed",
    linewidth = 0.2,
    color = "black"
  ) +
  ggplot2::geom_smooth(
    mapping = ggplot2::aes(y = upr),
    se = FALSE,
    method = "loess",
    formula = y ~ x,
    linetype = "dashed",
    linewidth = 0.2,
    color = "black"
  ) +
  ggplot2::geom_point() +
  ggplot2::geom_abline(intercept = 0, slope = 1, color = "red") +
  ggplot2::labs(
    x = "Observed", 
    y = "Predicted", 
    subtitle = latex2exp::TeX(
      paste0(
        lm_str_fun(fit_engine, digits = 3, latex2exp = FALSE), " | ",
        "$R^{2} = ", round(broom::glance(fit)$r.squared, 3), "$ | ",
        "$R^{2}_{adj} = ", round(broom::glance(fit)$adj.r.squared, 3), "$"
      )
    )
  )
```

Relation between observed and predicted values. The red line is a 45-degree line originating from the plane's origin and represents a perfect fit. The shaded area depicts a smoothed version of the 95% confidence of the [prediction interval](http://www.sthda.com/english/articles/40-regression-analysis/166-predict-in-r-model-predictions-and-confidence-intervals/).
:::

### Adjusted predictions

If we keep all predictors fixed except one, we can observe the regression line between that independent variable and the dependent variable. Here, I present different visualizations of the fitted model. This is important for understanding how each predictor is associated with the outcome.

::: {#fig-model-fit-comparison-2}
```{r}
#| code-fold: true

fit |>
  broom::augment(data) |>
  ggplot2::ggplot(ggplot2::aes(bill_length_mm, flipper_length_mm)) +
  ggplot2::geom_point() +
  ggplot2::geom_line(
    ggplot2::aes(y = .pred, color = "Prediction"),
    linewidth = 0.5,
    alpha = 0.5
  ) +
  ggplot2::geom_function(
    ggplot2::aes(y = .pred, color = "Adjusted prediction"),
    fun = lm_fun(fit_engine, fix_all_but = 1, data = data),
    linewidth = 1
  ) +
  ggplot2::labs(
    x = "Bill length (mm)",
    y = "Flipper length (mm)",
    subtitle = lm_str_fun(fit_engine, fix_all_but = 1),
    color = ggplot2::element_blank()
  ) +
  ggplot2::scale_color_manual(
    values = c("Prediction" = "blue", "Adjusted prediction" = "red")
  )
```

Model prediction (blue line) and adjusted prediction (red line) plotted against a scatter plot of the dependent variable (`flipper_length_mm`) and one of the independent variables (`bill_length_mm`). The adjusted prediction is calculated by holding the `bill_depth_mm` variable constant at its mean value.
:::

::: {#fig-model-fit-comparison-3}
```{r}
#| code-fold: true

fit |>
  broom::augment(data) |>
  ggplot2::ggplot(ggplot2::aes(bill_depth_mm, flipper_length_mm)) +
  ggplot2::geom_point() +
  ggplot2::geom_line(
    ggplot2::aes(y = .pred, color = "Prediction"), 
    linewidth = 0.5,
    alpha = 0.5
  ) +
  ggplot2::geom_function(
    ggplot2::aes(y = .pred, color = "Adjusted prediction"),
    fun = lm_fun(fit_engine, fix_all_but = 2, data = data),
    linewidth = 1
  ) +
  ggplot2::labs(
    x = "Bill depth (mm)",
    y = "Flipper length (mm)",
    subtitle = lm_str_fun(fit_engine, fix_all_but = 2),
    color = ggplot2::element_blank()
  ) +
  ggplot2::scale_color_manual(
    values = c("Prediction" = "blue", "Adjusted prediction" = "red")
  )
```

Model prediction (blue line) and adjusted prediction (red line) plotted against a scatter plot of the dependent variable (`flipper_length_mm`) and one of the independent variables (`bill_depth_mm`). The adjusted prediction is calculated by holding the `bill_length_mm` variable constant at its mean value.
:::

<!-- See: https://cran.r-project.org/web/packages/predict3d/vignettes/predict3d.html -->

```{r}
#| eval: false
#| include: false

fit_engine |>
  predict3d::ggPredict(
    show.point = TRUE,
    se = FALSE, 
    alpha = 0.1,
    show.text = FALSE,
    xpos = 0.5,
    digits = 3,
    facet.modx = FALSE
  )
```

::: {#fig-model-fit-comparison-4}
```{r}
#| code-fold: true

fit_engine_2 |>
  ggeffects::predict_response(
    terms = c("bill_length_mm", "bill_depth_mm")
  ) |> 
  plot(show_data = TRUE, verbose = FALSE) +
  ggplot2::labs(
    title = ggplot2::element_blank(),
    x = "Bill length (mm)",
    y = "Flipper length (mm)",
    color = "Bill depth (mm)"
  ) +
  ggplot2::theme_gray()
```

Relationship between `flipper_length_mm` and `bill_length_mm`, with data points represented as dots. The three lines show model predictions for different values of the variable `bill_depth_mm`, with the shaded areas representing confidence intervals.
:::

### Posterior predictive checks

Posterior predictive checks are a Bayesian technique used to assess model fit by comparing observed data to data simulated from the [posterior predictive distribution](https://en.wikipedia.org/wiki/Posterior_predictive_distribution) (i.e., the distribution of potential unobserved values given the observed data). These checks help identify systematic discrepancies between the observed and simulated data, providing insight into whether the chosen model (or distributional family) is appropriate. Ideally, the model-predicted lines should closely match the observed data patterns.

::: {#fig-model-fit-comparison-5}
```{r}
#| code-fold: true

diag_sum_plots <- 
  fit_engine_2 |> 
  performance::check_model(
    panel = FALSE,
    colors = c("red", "black", "black")
  ) |>
  plot() |>
  rutils::shush()

diag_sum_plots$PP_CHECK +
  ggplot2::labs(
    title = ggplot2::element_blank(),
    subtitle = ggplot2::element_blank(),
    x = "Flipper length (mm)",
  ) +
  ggplot2::theme_gray()
```

Posterior predictive checks for the model. The red line represents the observed data, while the black lines represent the model-predicted data.
:::

## Performing model diagnostics

::: {.callout-warning}
Before using objective assumption tests (e.g., Anderson–Darling test), it's important to note that they may be not advisable in some contexts. In larger samples, these tests can be overly sensitive to minor deviations, while in smaller samples, they may not detect significant deviations. Additionally, they might overlook visual patterns that are not captured by a single metric. Therefore, visual assessment of diagnostic plots may be a better way [@shatz2024; @kozak2018; @schucany2006]. For a straightforward critique of normality tests specifically, refer to [this](https://towardsdatascience.com/stop-testing-for-normality-dba96bb73f90) article by @greener2020.
:::

### Normality

**Assumption 2** is satisfied, as the residuals shown a normal distribution in 11 types of normality tests with different approaches (e.g., moments, regression/correlations; ECDFs).

#### Visual inspection

::: {#fig-model-residual-diag-normality-1}
```{r}
#| code-fold: true

fit_engine |>
  stats::residuals() |>
  test_normality(name = "Residuals")
```

Histogram of the model residuals with a kernel density estimate, along with a quantile-quantile (Q-Q) plot between the residuals and the theoretical quantiles of the normal distribution.
:::

::: {#fig-model-residual-diag-normality-2}
```{r}
#| code-fold: true

fit |> 
  broom::augment(data) |>
  dplyr::select(.resid) |>
  tidyr::pivot_longer(.resid) |>
  ggplot2::ggplot(ggplot2::aes(x = name, y = value, fill = name)) +
  ggplot2::geom_boxplot(
    outlier.colour = "red", 
    outlier.shape = 1,
    width = 0.5
  ) +
  ggplot2::geom_jitter(width = 0.2, alpha = 0.1, color = "black", size = 0.5) +
  ggplot2::scale_fill_manual(
    labels = "Residuals",
    values = gg_color_hue(1)
  ) +
  ggplot2::labs(x = "Variable", y = "Value", fill = ggplot2::element_blank()) +
  ggplot2::coord_flip() +
  ggplot2::theme(
    axis.title.y = ggplot2::element_blank(),
    axis.text.y = ggplot2::element_blank(),
    axis.ticks.y = ggplot2::element_blank()
  )
```

Boxplot of model residuals with outliers highlighted in red and other residuals indicated by jittered points.
:::

::: {#tbl-model-residual-diag-normality-1}
```{r}
#| code-fold: true

fit_engine |>
  stats::residuals() |>
  stats_sum(name = "Residuals")
```

Summary statistics of model residuals.
:::

#### Tests

It's important to note that the Kolmogorov-Smirnov and Pearson chi-square tests are included here just for reference, as many authors don't recommend using them when testing for normality [@dagostino1990]. Learn more about normality tests in @thode2002.

I also recommend checking the original papers for each test to understand their assumptions and limitations:

- [Anderson-Darling test](https://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test): @anderson1952; @anderson1954.
- Bonett-Seier test: @bonett2002.
- [Cramér-von Mises test](https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion): @cramer1928; @anderson1962.
- [D'Agostino test](https://en.wikipedia.org/wiki/D%27Agostino%27s_K-squared_test): @dagostino1971; @dagostino1973.
- [Jarque–Bera test](https://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test): @jarque1980; @bera1981; @jarque1987.
- [Lilliefors (K-S) test](https://en.wikipedia.org/wiki/Lilliefors_test):  @smirnov1948; @kolmogorov1933; @massey1951; @lilliefors1967; @dallal1986.
- [Pearson chi-square test](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test): @pearson1900.
- [Shapiro-Francia test](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Francia_test): @shapiro1972.
- [Shapiro-Wilk test](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test): @shapiro1965.

$$
\begin{cases}
\text{H}_{0}: \text{The data is normally distributed} \\
\text{H}_{a}: \text{The data is not normally distributed}
\end{cases}
$$

::: {#tbl-model-residual-diag-normality-2}
```{r}
#| code-fold: true

fit_engine |>
  stats::residuals() |>
  normality_sum()
```

Summary of statistical tests conducted to assess the normality of the residuals.
:::

Correlation between observed residuals and expected residuals under normality.

```{r}
#| code-fold: false

fit_engine |> olsrr::ols_test_correlation()
```

### Linearity

**Assumption 3** is satisfied, as the relationship between the variables is fairly linear.

::: {#fig-model-residual-diag-fitted-values-1}
```{r}
#| code-fold: true

fit |>
  broom::augment(data) |>
  ggplot2::ggplot(ggplot2::aes(.pred, .resid)) +
  ggplot2::geom_point() +
  ggplot2::geom_hline(
    yintercept = 0, 
    color = "black", 
    linewidth = 0.5,
    linetype = "dashed" 
  ) +
  ggplot2::geom_smooth(formula = y ~ x, method = "loess", color = "red") +
  ggplot2::labs(x = "Fitted values", y = "Residuals")
```

Residual plot showing the relationship between fitted values and residuals, with the dashed black line representing zero residuals, indicating an ideal model fit, and the red line indicating the smoothed conditional mean of residuals, with the shaded region representing the confidence interval of this estimate.
:::

::: {#fig-model-residual-diag-linearity-1}
```{r}
#| code-fold: true

plots <- fit_engine |> olsrr::ols_plot_resid_fit_spread(print_plot = FALSE)

for (i in seq_along(plots)) {
  q <- plots[[i]] + ggplot2::labs(title = ggplot2::element_blank())
  
  q <- q |> ggplot2::ggplot_build()
  q$data[[1]]$colour <- "red"
  q$plot$layers[[1]]$constructor$color <- "red"
  
  plots[[i]] <- q |> ggplot2::ggplot_gtable() |> ggplotify::as.ggplot()
}
  
cowplot::plot_grid(plots$fm_plot, plots$rsd_plot, ncol = 2, nrow = 1)
```

Residual fit spread plots to detect non-linearity, influential observations, and outliers. The side-by-side plots show the centered fit and residuals, illustrating the variation explained by the model and what remains in the residuals. Inappropriately specified models often exhibit greater spread in the residuals than in the centered fit. "Proportion Less" indicates the cumulative distribution function, representing the proportion of observations below a specific value, facilitating an assessment of model performance.
:::

The [Ramsey's RESET test](https://en.wikipedia.org/wiki/Ramsey_RESET_test) indicates that the model has no omitted variables. This test examines whether non-linear combinations of the fitted values can explain the response variable.

Learn more about the Ramsey's RESET test in: @ramsey1969.

$$
\begin{cases}
\text{H}_{0}: \text{The model has no omitted variables} \\
\text{H}_{a}: \text{The model has omitted variables}
\end{cases}
$$

```{r}
fit_engine |> lmtest::resettest(power = 2:3)
```

```{r}
fit_engine |> lmtest::resettest(type = "regressor")
```

### Homoscedasticity (common variance)

**Assumption 4** is satisfied, as the residuals exhibit constant variance. While some heteroscedasticity is present, the Breusch-Pagan test (not studentized) indicate that it is not severe.

When comparing the  standardized residuals ($\sqrt{|\text{Standardized Residuals}|}$) spread to the fitted values and each predictor, we can observe that the residuals are fairly constant across the range of values. This suggests that the residuals have a constant variance.

#### Visual inspection

```{r}
#| eval: false
#| include: false

# Based on:
# https://sscc.wisc.edu/sscc/pubs/RegDiag-R/homoscedasticity.html#:~:text=We%20must%20plot%20the%20residuals%20against%20the%20fitted%20values%20and%20against%20each%20of%20the%20predictors.
```

::: {#fig-model-diag-homoscedasticity-1}
```{r}
#| code-fold: true

fit |>
  stats::predict(data) |>
  dplyr::mutate(
    .sd_resid = 
      fit_engine |>
      stats::rstandard() |> 
      abs() |>
      sqrt()
  ) |>
  ggplot2::ggplot(ggplot2::aes(.pred, .sd_resid)) +
  ggplot2::geom_point() +
  ggplot2::geom_smooth(formula = y ~ x, method = "loess", color = "red") +
  ggplot2::labs(
    x = "Fitted values", 
    y = latex2exp::TeX("$\\sqrt{|Standardized \\ Residuals|}$")
  )
```

Relation between the fitted values of the model and its standardized residuals.
:::

::: {#fig-model-diag-homoscedasticity-2}
```{r}
#| code-fold: true

fit |>
  stats::predict(data) |>
  dplyr::mutate(
    .sd_resid = 
      fit_engine |>
      stats::rstandard() |> 
      abs() |>
      sqrt()
  ) |>
  dplyr::bind_cols(data) |>
  ggplot2::ggplot(ggplot2::aes(bill_length_mm, .sd_resid)) +
  ggplot2::geom_point() +
  ggplot2::geom_smooth(formula = y ~ x, method = "loess", color = "red") +
  ggplot2::labs(
    x = "Bill length (mm)", 
    y = latex2exp::TeX("$\\sqrt{|Standardized \\ Residuals|}$")
  )
```

Relation between `bill_length_mm` and the model standardized residuals.
:::

::: {#fig-model-diag-homoscedasticity-3}
```{r}
#| code-fold: true

fit |>
  stats::predict(data) |>
  dplyr::mutate(
    .sd_resid = 
      fit_engine |>
      stats::rstandard() |> 
      abs() |>
      sqrt()
  ) |>
  dplyr::bind_cols(data) |>
  ggplot2::ggplot(ggplot2::aes(bill_depth_mm, .sd_resid)) +
  ggplot2::geom_point() +
  ggplot2::geom_smooth(formula = y ~ x, method = "loess", color = "red") +
  ggplot2::labs(
    x = "Bill depth (mm)", 
    y = latex2exp::TeX("$\\sqrt{|Standardized \\ Residuals|}$")
  )
```

Relation between `bill_depth_mm` and the model standardized residuals.
:::

#### Breusch-Pagan test

The [Breusch-Pagan test](https://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test) test indicates that the residuals exhibit constant variance.

Learn more about the Breusch-Pagan test in: @breusch1979 and @koenker1981.

$$
\begin{cases}
\text{H}_{0}: \text{The variance is constant} \\
\text{H}_{a}: \text{The variance is not constant}
\end{cases}
$$

```{r}
fit_engine_2 |> performance::check_heteroscedasticity()
```

```{r}
# With studentising modification of Koenker
fit_engine |> lmtest::bptest(studentize = TRUE)
```

```{r}
fit_engine |> lmtest::bptest(studentize = FALSE)
```

```{r}
# Using the studentized modification of Koenker.
fit_engine |> skedastic::breusch_pagan(koenker = TRUE)
```

```{r}
fit_engine |> skedastic::breusch_pagan(koenker = FALSE)
```

```{r}
lm(form, data = data) |> car::ncvTest()
```

```{r}
fit_engine |> olsrr::ols_test_breusch_pagan()
```

#### White's test

The [White's test](https://en.wikipedia.org/wiki/White_test) is a general test for heteroskedasticity. It is a generalization of the Breusch-Pagan test and is more flexible in terms of the types of heteroskedasticity it can detect. It has the same null hypothesis as the Breusch-Pagan test.

Like the Breusch-Pagan, the results of the White's test indicate that the residuals exhibit constant variance.

Learn more about the White's test in: @white1980.

```{r}
fit_engine |> skedastic::white()
```

### Independence

**Assumption 5** is satisfied. Although the residuals show some autocorrelation, they fall within the acceptable range of the Durbin–Watson statistic (1.5 to 2.5). It's also important to note that the observations for each predicted value are not related to any other prediction; in other words, they are not grouped or sequenced by any variable (by design) (see @hair2019[p. 291] for more information).

Many authors don't consider autocorrelation tests for linear regression models, as they are more relevant for time series data. However, I include them here just for reference.

#### Visual inspection

::: {#fig-model-diag-independence-1}
```{r}
#| code-fold: true

fit_engine |> 
  residuals() |>
  forecast::ggtsdisplay(lag.max=30)
```

Time series plot of the residuals along with its AutoCorrelation Function (ACF) and Partial AutoCorrelation Function (PACF).
:::

#### Correlations

@tbl-model-residual-diag-independence-1 shows the relative importance of independent variables in determining the response variable. It highlights how much each variable uniquely contributes to the R-squared value, beyond what is explained by the other predictors.

::: {#tbl-model-residual-diag-independence-1}
```{r}
#| code-fold: true

fit_engine |> olsrr::ols_correlations()
```

Correlations between the dependent variable and the independent variables, along with the zero-order, part, and partial correlations. The zero-order correlation represents the Pearson correlation coefficient between the dependent and independent variables. Part correlations indicate how much the R-squared would decrease if a specific variable were removed from the model, while partial correlations reflect the portion of variance in the response variable that is explained by a specific independent variable, beyond the influence of other predictors in the model.
:::

#### Newey-West estimator

The [Newey-West estimator](https://en.wikipedia.org/wiki/Newey%E2%80%93West_estimator) is a method used to estimate the [covariance matrix](https://en.wikipedia.org/wiki/Covariance_matrix) of the coefficients in a regression model when the residuals are autocorrelated.

Learn more about the Newey-West estimator in: @newey1987 and @newey1994.

```{r}
fit_engine |> sandwich::NeweyWest()
```

The Heteroscedasticity and autocorrelation consistent (HAC) estimation of the covariance matrix of the coefficient estimates can also be computed in other ways. The HAC estimator below is a implementation made by @zeileis2004.

```{r}
fit_engine |> sandwich::vcovHAC()
```

#### Durbin-Watson test

The [Durbin-Watson test](https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic) is a statistical test used to detect the presence of autocorrelation at lag 1 in the residuals from a regression analysis. The test statistic ranges from 0 to 4, with a value of 2 indicating no autocorrelation. Values less than 2 indicate positive autocorrelation, while values greater than 2 indicate negative autocorrelation [@fox2016].

A common rule of thumb in the statistical community is that a Durbin-Watson statistic between 1.5 and 2.5 suggests little to no autocorrelation.

Learn more about the Durbin-Watson test in: @durbin1950; @durbin1951; and @durbin1971.

$$
\begin{cases}
\text{H}_{0}: \text{Autocorrelation of the disturbances is 0} \\
\text{H}_{a}: \text{Autocorrelation of the disturbances is not equal to 0}
\end{cases}
$$

```{r}
lmtest::dwtest(fit_engine)
```

```{r}
car::durbinWatsonTest(fit_engine)
```

#### Ljung-Box test

The Ljung–Box test is a statistical test used to determine whether any autocorrelations within a time series are significantly different from zero. Rather than testing randomness at individual lags, it assesses the "overall" randomness across multiple lags.

Learn more about the [Ljung-Box test](https://en.wikipedia.org/wiki/Ljung%E2%80%93Box_test) in: @box1970 and @ljung1978.

$$
\begin{cases}
\text{H}_{0}: \text{Residuals are independently distributed} \\
\text{H}_{a}: \text{Residuals are not independently distributed}
\end{cases}
$$

```{r}
fit_engine |>
  stats::residuals() |>
  stats::Box.test(type = "Ljung-Box", lag = 10)
```

### Colinearity/Multicollinearity

No high degree of colinearity was observed among the independent variables.

#### Variance Inflation Factor (VIF)

The [Variance Inflation Factor (VIF)](https://en.wikipedia.org/wiki/Variance_inflation_factor) indicates the effect of other independent variables on the standard error of a regression coefficient. The VIF is directly related to the tolerance value ($\text{VIF}_{i} = 1/\text{TO}L$). High VIF values (larger than ~5 [@struck2024]) suggest significant collinearity or multicollinearity among the independent variables [@hair2019, p. 265].

::: {#fig-model-diag-colinearity-1}
```{r}
#| code-fold: true

diag_sum_plots <- 
  fit_engine_2 |> 
  performance::check_model(panel = FALSE) |>
  plot() |>
  rutils::shush()

diag_sum_plots$VIF + 
  ggplot2::labs(
    title = ggplot2::element_blank(),
    subtitle = ggplot2::element_blank()
  ) +
  ggplot2::theme(
    legend.position = "right",
    axis.title = ggplot2::element_text(size = 11, colour = "black"),
    axis.text = ggplot2::element_text(colour = "gray25"),
    axis.text.y = ggplot2::element_text(size = 9),
    legend.text = ggplot2::element_text(colour = "black")
  )
```

Variance Inflation Factors (VIF) for each predictor variable. VIFs below 5 are considered acceptable. Between 5 and 10, the variable should be examined. Above 10, the variable must considered highly collinear.
:::

::: {#tbl-model-residual-diag-colinearity-1}
```{r}
#| code-fold: true

fit_engine |> olsrr::ols_vif_tol()
```

Variance Inflation Factors (VIF) and tolerance values for each predictor variable.
:::

::: {#tbl-model-residual-diag-colinearity-2}
```{r}
#| code-fold: true

fit_engine_2 |> performance::check_collinearity()
```

Variance Inflation Factors (VIF) and tolerance values for each predictor variable.
:::

#### Condition Index

The [condition index](https://en.wikipedia.org/wiki/Condition_number) is a measure of multicollinearity in a regression model. It is based on the [eigenvalues](https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors) of the correlation matrix of the predictors. A condition index of 30 or higher is generally considered indicative of significant collinearity [@belsley2004, p. 112-114].

::: {#tbl-model-residual-diag-colinearity-2}
```{r}
#| code-fold: true

fit_engine |> olsrr::ols_eigen_cindex()
```

Condition indexes and eigenvalues for each predictor variable.
:::

### Measures of influence

In this section, we will check several measures of influence that can be used to assess the impact of individual observations on the model estimates.

But first, let's define some terms:

Leverage points
: Leverage is a measure of the distance between individual values of a predictor and other values of the predictor. In other words, a point with high leverage has an x-value far away from the other x-values. Points with high leverage have the potential to influence the model estimates [@struck2024; @hair2019, p. 262; @nahhas2024].

Influence points
: Influence is a measure of how much an observation affects the model estimates. If an observation with large influence were removed from the dataset, we would expect a large change in the predictive equation [@struck2024; @nahhas2024].

#### Standardized residuals

Standardized residuals are a rescaling of the residual to a common basis by dividing each residual by the standard deviation of the residuals [@hair2019, p. 264].

```{r}
fit_engine |> stats::rstandard() |> head()
```
::: {#fig-model-diag-influence-1}
```{r}
#| code-fold: true

dplyr::tibble(
  x = seq_len(nrow(data)),
  std = stats::rstandard(fit_engine)
) |>
  ggplot2::ggplot(
    ggplot2::aes(x = x, y = std, ymin = 0, ymax = std)
  ) +
  ggplot2::geom_linerange(color = "blue") +
  ggplot2::geom_hline(yintercept = 2, color = "black") +
  ggplot2::geom_hline(yintercept = -2, color = "black") +
  ggplot2::geom_hline(yintercept = 3, color = "red") +
  ggplot2::geom_hline(yintercept = -3, color = "red") +
  ggplot2::scale_y_continuous(breaks = seq(-3, 3)) +
  ggplot2::labs(
    x = "Observation",
    y = "Standardized residual"
  )
```

Standardized residuals for each observation.
:::

```{r}
#| eval: false
#| include: false

fit_engine |> olsrr::ols_plot_resid_stand()
```

#### Studentized residuals

[Studentized residuals](https://en.wikipedia.org/wiki/Studentized_residual) are a commonly used variant of the standardized residual. It differs from other methods in how it calculates the standard deviation used in standardization. To minimize the effect of any observation on the standardization process, the standard deviation of the residual for observation $i$ is computed from regression estimates omitting the $i$th observation in the calculation of the regression estimates [@hair2019, p. 264].

```{r}
fit_engine |> stats::rstudent() |> head()
```

::: {#fig-model-diag-influence-2}
```{r}
#| code-fold: true

dplyr::tibble(
  x = seq_len(nrow(data)),
  std = stats::rstudent(fit_engine)
) |>
  ggplot2::ggplot(
    ggplot2::aes(x = x, y = std, ymin = 0, ymax = std)
  ) +
  ggplot2::geom_linerange(color = "blue") +
  ggplot2::geom_hline(yintercept = 2, color = "black") +
  ggplot2::geom_hline(yintercept = -2, color = "black") +
  ggplot2::geom_hline(yintercept = 3, color = "red") +
  ggplot2::geom_hline(yintercept = -3, color = "red") +
  ggplot2::scale_y_continuous(breaks = seq(-3, 3)) +
  ggplot2::labs(
    x = "Observation",
    y = "Studentized residual"
  )
```

Studentized residuals for each observation.
:::

::: {#fig-model-diag-influence-3}
```{r}
#| code-fold: true

fit |> 
  broom::augment(data) |>
  dplyr::mutate(
    std = stats::rstudent(fit_engine)
  ) |>
  ggplot2::ggplot(ggplot2::aes(.pred, std)) +
  ggplot2::geom_point(color = "blue") +
  ggplot2::geom_hline(yintercept = 2, color = "black") +
  ggplot2::geom_hline(yintercept = -2, color = "black") +
  ggplot2::geom_hline(yintercept = 3, color = "red") +
  ggplot2::geom_hline(yintercept = -3, color = "red") +
  ggplot2::scale_y_continuous(breaks = seq(-3, 3)) +
  ggplot2::labs(
    x = "Predicted value",
    y = "Studentized residual"
  )
```

Relation between studentized residuals and fitted values.
:::

::: {#fig-model-diag-influence-4}
```{r}
#| code-fold: true

plot <- 
  fit_engine |> 
  olsrr::ols_plot_resid_lev(threshold = 2, print_plot = FALSE)

plot$plot +
  ggplot2::labs(
    title = ggplot2::element_blank(),
    y = "Studentized residual"
  )
```

Relation between studentized residuals and their leverage points.
:::

```{r}
#| eval: false
#| include: false

fit_engine |> olsrr::ols_plot_resid_stud()
```

```{r}
#| eval: false
#| include: false

fit_engine |> 
  car::influenceIndexPlot(
    vars = "Studentized",
    id = FALSE,
    main = NULL
  )
```

```{r}
#| eval: false
#| include: false

fit_engine |> olsrr::ols_plot_resid_stud_fit()
```

#### Hat values

The hat value indicates how distinct an observation’s predictor values are from those of other observations. Observations with high hat values have high leverage and may be, though not necessarily, influential. There is no fixed threshold for what constitutes a “large” hat value; instead, the focus must be on observations with hat values significantly higher than the rest [@nahhas2024; @hair2019, p. 261].

```{r}
fit_engine |> stats::hatvalues() |> head()
```

::: {#fig-model-diag-influence-5}
```{r}
#| code-fold: true

dplyr::tibble(
  x = seq_len(nrow(data)),
  hat = stats::hatvalues(fit_engine)
) |>
  ggplot2::ggplot(
    ggplot2::aes(x = x, y = hat, ymin = 0, ymax = hat)
  ) +
  ggplot2::geom_linerange(color = "blue") +
  ggplot2::labs(
    x = "Observation",
    y = "Hat value"
  )
```

Hat values for each observation.
:::

```{r}
#| eval: false
#| include: false

fit_engine |> 
  car::influenceIndexPlot(
    vars = "hat",
    id = FALSE,
    main = NULL
  )
```

#### Cook's distance

The [Cook's D](https://en.wikipedia.org/wiki/Cook%27s_distance) measures each observation's influence on the model's fitted values. It is considered one of the most representative metrics for assessing overall influence [@hair2019].

A common practice is to flag observations with a Cook's distance of 1.0 or greater. However, a threshold of $4 / (n - k - 1)$, where $n$ is the sample size and $k$ is the number of independent variables, is suggested as a more conservative measure in small samples or for use with larger datasets [@hair2019].

Learn more about Cook's D in: @cook1977; @cook1979.

```{r}
fit_engine |> stats::cooks.distance() |> head()
```

::: {#fig-model-diag-influence-6}
```{r}
#| code-fold: true

plot <- 
  fit_engine |> 
  olsrr::ols_plot_cooksd_bar(type = 2, print_plot = FALSE)

# The following procedure changes the plot aesthetics.
q <- plot$plot + ggplot2::labs(title = ggplot2::element_blank())
q <- q |> ggplot2::ggplot_build()
q$data[[5]]$label <- ""

q |> ggplot2::ggplot_gtable() |> ggplotify::as.ggplot()
```

Cook's distance for each observation along with a threshold line at $4 / (n - k - 1)$.
:::

::: {#fig-model-diag-influence-7}
```{r}
#| code-fold: true

diag_sum_plots <- 
  fit_engine_2 |> 
  performance::check_model(
    panel = FALSE,
    colors = c("blue", "black", "black")
  ) |>
  plot() |>
  rutils::shush()

plot <- 
  diag_sum_plots$OUTLIERS +
  ggplot2::labs(
    title = ggplot2::element_blank(),
    subtitle = ggplot2::element_blank(),
    x = "Leverage",
    y = "Studentized residuals"
  ) +
  ggplot2::theme(
    legend.position = "right",
    axis.title = ggplot2::element_text(size = 11, colour = "black"),
    axis.text = ggplot2::element_text(colour = "gray25"),
    axis.text.y = ggplot2::element_text(size = 9)
  ) +
  ggplot2::theme_gray()

plot <- plot |> ggplot2::ggplot_build()

# The following procedure changes the plot aesthetics.
for (i in c(1:9)) {
  # "#1b6ca8" "#3aaf85"
  plot$data[[i]]$colour <- dplyr::case_when(
    plot$data[[i]]$colour == "blue" ~ ifelse(i == 4, "red", "blue"),
    plot$data[[i]]$colour == "#1b6ca8" ~ "black",
    plot$data[[i]]$colour == "darkgray" ~ "black",
    TRUE ~ plot$data[[i]]$colour
  )
}

plot |> ggplot2::ggplot_gtable() |> ggplotify::as.ggplot()
```

Relation between studentized residuals and their leverage points. The blue line represents the [Cook's distance](https://en.wikipedia.org/wiki/Cook%27s_distance). Any points outside the contour lines are influential observations.
:::

```{r}
#| eval: false
#| include: false

fit_engine |> olsrr::ols_plot_cooksd_chart()
```

```{r}
#| eval: false
#| include: false

fit_engine |> 
  car::influenceIndexPlot(
    vars = "Cook",
    id = FALSE,
    main = NULL
  )
```

####  Influence on prediction (DFFITS)

[DFFITS](https://en.wikipedia.org/wiki/DFFITS) (difference in fits) is a standardized measure of how much the prediction for a given observation would change if it were deleted from the model. Each observation’s DFFITS is standardized by the standard deviation of fit at that point [@struck2024].

The best rule of thumb is to classify as influential any standardized values that exceed $2 \sqrt{(p / n)}$, where $p$ is the number of independent variables + 1 and $n$ is the sample size  [@hair2019, p. 261].

Learn more about DDFITS in: @welsch1977 and @belsley2004.

```{r}
fit_engine |> stats::dffits() |> head()
```

::: {#fig-model-diag-influence-8}
```{r}
#| code-fold: true

plot <- fit_engine |> 
  olsrr::ols_plot_dffits(print_plot = FALSE)

plot$plot + ggplot2::labs(title = ggplot2::element_blank())
```

Standardized DFFITS (difference in fits) for each observation.
:::

####  Influence on parameter estimates (DFBETAS)

[DFBETAS](https://en.wikipedia.org/wiki/Influential_observation#:~:text=measures%20of%20influence) are a measure of the change in a regression coefficient when an observation is omitted from the regression analysis. **The value of the DFBETA is in terms of the coefficient itself** [@hair2019, p. 261]. A cutoff for what is considered a large DFBETAS value is $2 / \sqrt{n}$, where $n$ is the number of observations. [@struck2024].

Learn more about DFBETAS in: @welsch1977 and @belsley2004.

```{r}
fit_engine |> stats::dfbeta() |> head()
```

```{r}
plots <- fit_engine |> olsrr::ols_plot_dfbetas(print_plot = FALSE)
```

::: {#fig-model-diag-influence-9}
```{r}
#| code-fold: true

plots$plots[[1]] + 
  ggplot2::labs(title = "Intercept coefficient")
```

Standardized DFBETAS values for each observation concerning the **intercept** coefficient.
:::

::: {#fig-model-diag-influence-10}
```{r}
#| code-fold: true

plots$plots[[2]] + 
  ggplot2::labs(title = "bill_length_mm coefficient")
```

DFBETAS values for each observation concerning the **bill_length_mm** coefficient.
:::

::: {#fig-model-diag-influence-11}
```{r}
#| code-fold: true

plots$plots[[3]] + 
  ggplot2::labs(title = "bill_depth_mm coefficient")
```


Standardized DFBETAS values for each observation concerning the **bill_depth_mm** coefficient.
:::

#### Hadi's measure

Hadi’s measure of influence is based on the idea that influential observations can occur in either the response variable, the predictors, or both.

Learn more about Hadi's measure in: @chatterjee2012.

::: {#fig-model-diag-influence-12}
```{r}
#| code-fold: true

plot <- 
  fit_engine |> 
  olsrr::ols_plot_hadi(print_plot = FALSE)

plot +
  ggplot2::labs(
    title = ggplot2::element_blank(),
    y = "Hadi's measure"
  )
```

Hadi's influence measure for each observation.
:::

::: {#fig-model-diag-influence-13}
```{r}
#| code-fold: true

plot <- 
  fit_engine |> 
  olsrr::ols_plot_resid_pot(print_plot = FALSE)

plot + ggplot2::labs(title = ggplot2::element_blank())
```

Potential-residual plot classifying unusual observations as high-leverage points, outliers, or a combination of both.
:::

## Testing the hypothesis

Let's now come back to our initial hypothesis:

__Statement__
: The predictors `bill_length_mm` and `bill_depth_mm` effectively predict `flipper_length_mm`.

$$
\begin{cases}
\text{H}_{0}: \text{R}^{2}_{\text{adj}} \leq 0.5 \\
\text{H}_{a}: \text{R}^{2}_{\text{adj}} > 0.5
\end{cases}
$$

In addition to an adjusted $\text{R}^{2}$ greater than 0.5, predictors should demonstrate statistically significant associations, and the model assumptions should be satisfied.

In the sections above, we confirmed that the model satisfies the assumptions of linearity, normality, homoscedasticity, and independence. (**Model validity requirement**: True)

@tbl-hypothesis-test-1 shows that the predictors `bill_length_mm` and `bill_depth_mm` are statistically significant in predicting `flipper_length_mm`. (**Predictor significance requirement**: True)

::: {#tbl-hypothesis-test-1}
```{r}
#| code-fold: true

fit |> 
  broom::tidy() |>
  janitor::adorn_rounding(5)
```

Output from the model fitting process showing the estimated coefficients, standard errors, test statistics, and p-values for the terms in the linear regression model.
:::

We can see in @tbl-hypothesis-test-2 that the adjusted $\text{R}^{2}$ is lower than 0.5, which means that the model does not explain more than 50% of the variance in the dependent variable (**Adjusted R-squared requirement**: False).

::: {#tbl-hypothesis-test-2}
```{r}
#| code-fold: true

fit |> 
  broom::glance() |> 
  tidyr::pivot_longer(cols = dplyr::everything()) |>
  janitor::adorn_rounding(10)
```

Summary of model fit statistics showing key metrics including R-squared, adjusted R-squared, sigma, statistic, p-value, degrees of freedom, log-likelihood, AIC, BIC, and deviance.
:::

::: {#tbl-hypothesis-test-3}
```{r}
#| code-fold: true

psychometric::CI.Rsq(
  rsq = broom::glance(fit)$adj.r.squared,
  n = nrow(data),
  k = length(fit_engine$coefficients) - 1,
  level = 0.95
)
```

Confidence interval for the adjusted R-squared value. LCL correspond to the lower limit, and UCL to the upper limit.
:::

Therefore, **we must reject the alternative hypothesis in favor of the null hypothesis**. 

### Effect size

If we would like to interpret the adjusted $\text{R}^{2}$ value in terms of effect size, we can use the following code:

<!-- @cohen1988 -->

```{r}
fit |> 
  broom::glance() |> 
  magrittr::extract2("adj.r.squared") |>
  effectsize::interpret_r2("cohen1988")
```

<!-- @falk1992 -->

```{r}
fit |> 
  broom::glance() |> 
  magrittr::extract2("adj.r.squared") |>
  effectsize::interpret_r2("falk1992")
```

## Conclusion

Following our criteria we can now answer our question:

**Can bill length and bill depth alone effectively predict flipper length in Adélie penguins?**

The answer is **No**. The model does not explain more than 50% of the variance in the dependent variable, which means that the predictors `bill_length_mm` and `bill_depth_mm` are not effective in predicting `flipper_length_mm`.

## Final remarks

I hope the explanations, visualizations, and code have helped clarify General Linear Models. If you have any questions, feel free to reach out. You can find me on [GitHub](https://github.com/danielvartan).

For further learning on general linear models, I recommend the following resources:

- @degroot2012
- @casella2002

- @allen1997
- @bussab1988 (pt-BR)
- @dalpiaz
- @dudek2020
- @fox2016
- @hair2019
- @johnson2013
- @kuhn2022
- @struck2024

Additionally, I highly recommend Josh Starmer’s [StatQuest](https://www.youtube.com/@statquest) and Christian Pascual's [Very Normal](https://www.youtube.com/@very-normal) YouTube channels. The following videos are especially helpful:

- Starmer, J. (Nov 18, 2022). _Multiple Regression, Clearly Explained!!!_ [YouTube video]. [https://youtu.be/EkAQAi3a4js?si=MfKPGlFcYqfFAM7j](https://youtu.be/EkAQAi3a4js?si=MfKPGlFcYqfFAM7j)
- Starmer, J. (Nov 18, 2022). _Multiple Regression in R, Step by Step!!!_ [YouTube video]. [https://youtu.be/mno47Jn4gaU?si=oba5odJm8fjeizs0](https://youtu.be/mno47Jn4gaU?si=oba5odJm8fjeizs0)

## References {.unnumbered}

::: {#refs}
:::