selection.qmd

---
title: "Model Selection"
subtitle: "Using latent class mixed models with natural cubic splines."
author:
  - name: "Nathan Contantine-Cooke" 
    url: https://scholar.google.com/citations?user=2emHWR0AAAAJ&hl=en&oi=ao
    corresponding: true
    affiliations:
      - ref: HGU
      - ref: CGEM
  - name: "Karla Monterrubio-Gómez"
    url: https://scholar.google.com/citations?user=YmyxSXAAAAAJ&hl=en
    affiliations:
      - ref: HGU
  - name: "Riccardo E. Marioni"
    url: https://scholar.google.com/citations?hl=en&user=gA3Ik3MAAAAJ
    affiliations: 
      - ref: CGEM
  - name: "Catalina A. Vallejos"
    url: https://scholar.google.com/citations?user=lkdrwm0AAAAJ&hl=en&oi=ao
    affiliations:
      - ref: HGU
      - ref: Turing
#comments:
#  giscus: 
#    repo: quarto-dev/quarto-docs
---
      
## Introduction

```{r Setup}
#| message: false
set.seed(123)

# Fitting the LCMMs takes around three hours (when running single threaded)
# if cache.models is true then the saved model objects will be used instead of
# refitting the models
cache.models <- TRUE

##########################
#--     Packages       --#
##########################

library(tidyverse)
## Modelling ##
library(lcmm)
library(splines)
## Presentation ##
library(htmltools)
library(patchwork)
library(ggdist)
library(grid)
library(ggalluvial)
library(qqplotr)

if (!require(DT)) {
  install.packages("DT")
}

##########################
#--     Data read      --#
##########################

FCcumulative <- readRDS(paste0("/Volumes/igmm/cvallejo-predicct/",
                               "cdi/processed/",
                               "FCcumulativeLongInc.RDS")
                        )

###########################################
#-- Create directories and readme files --#
###########################################

if (!dir.exists("plots")) {
  dir.create("plots")
}

fileConn <- file("plots/README.md")
writeLines(c("# README",
             "",
             paste("This directory contains plots created by the analysis", 
                   "but are not provided as figures (supplementary or", 
                   "otherwise) in the paper")),
           fileConn)
close(fileConn)


if (!dir.exists("cache")) dir.create("cache")
fileConn <- file("cache/README.md")
writeLines(c("# README",
             "",
             paste("This directory stores cached versions of R objects which",
                   "take a long time to create (such as LCMM fit objects).")),
           fileConn)
close(fileConn)

if (!dir.exists("paper")) dir.create("paper")
fileConn <- file("paper/README.md")
writeLines(c("# README",
             "",
             paste("This directory contains all figures used in the paper.",
                   "The sup subdirectory contains supplementary figures")),
           fileConn)
close(fileConn)

if (!dir.exists("paper/sup")) dir.create("paper/sup")

if (!dir.exists("plots/residuals")) {
  dir.create("plots/residuals")
}

########################
#-- Custom Functions --#
########################

# Build DT::datatable objects from matrix of fit statistics
DTbuild <- function(hlme.metric, caption) {
  hlme.metrics <- cbind(hlme.metrics, group = seq(1, nrow(hlme.metrics)))
  hlme.metrics <- hlme.metrics[, c(4, 1, 2, 3)]
  DT::datatable(round(as.data.frame(hlme.metrics), 2),
                options = list(dom = 't'),
                caption = tags$caption(
                  style = 'text-align: center;',
                  h3(caption)),
                style ="bootstrap4",
                rownames = FALSE,
                colnames = c("Clusters",
                             "Maximum log-likelihood",
                             "AIC",
                             "BIC"),
                escape = FALSE)
}

#' Spaghetti plots of each class
#' @param models list containing HLME objects
#' @param G How many classes does the model assume?
#' @param log Logical. Should plots be on log scale
#' @param indi Logical. Should separate plots for each class be generated?
#' @param multi Logical. Should all plots be plotted alongside each other? 
#' @param tmax Maximum observation period
#' @param column Logical. Should all sub-plots be in a single column? Defaults
#'   to false (two columns)
#' @param prob.cutoff Posterior probability cut-off for subjects to be included
#'   as trajectories
#' @param mapping Numeric vector which gives reordering of plots in a
#'   multiplot. Need to take into account plots are generated by column - not
#'   row
#' @param sizes Output latent class sizes
#' @param save Logical. Should sub figure labels be generated?
spaghetti_plot <- function(FCcumulative,
                           models,
                           G,
                           log = TRUE,
                           indi = FALSE, 
                           multi = TRUE,
                           tmax = 5,
                           column =  FALSE, 
                           pprob.cutoff = NA,
                           sizes = FALSE,
                           mapping = NULL,
                           save = FALSE){
  
  if(!is.na(pprob.cutoff)) {
    pprob.cutoffs <- c()
    
    for (subject in unique(models[[G]]$pprob$id)){
      temp <- subset(models[[G]]$pprob, id == subject)
      pprob <- temp[, 2 + temp$class]
      if (pprob > pprob.cutoff) {
        pprob.cutoffs <- c(pprob.cutoffs, subject)
      }
    }
    FCcumulative <- subset(FCcumulative, id %in% pprob.cutoffs)
  }
  
  if (indi){
    spaghetti_plot_sub(FCcumulative = FCcumulative, 
                       models = models,
                       G = G,
                       log = log,
                       multi = FALSE,
                       tmax = tmax,
                       column = column,
                       pprob = pprob,
                       sizes = sizes,
                       mapping = mapping)
    }
  if (multi){
      spaghetti_plot_sub(FCcumulative = FCcumulative, 
                         models = models,
                         G = G,
                         log = log,
                         multi = TRUE,
                         tmax = tmax,
                         column = column,
                         pprob = pprob,
                         sizes = sizes,
                         mapping = mapping,
                         save = save)
  }
}

spaghetti_plot_sub <- function(FCcumulative,
                           models,
                           G,
                           multi = FALSE,
                           log,
                           unit,
                           tmax = 5,
                           column = column,
                           pprob = NA,
                           sizes = sizes,
                           mapping = mapping,
                           save = save){
  
  labels <- c("A", "C", "B", "D")
  
  time <- seq(0, tmax, by = 0.01)
  
  if (column) {
    # use single column layout for sub-plots
    layout <- matrix(seq(1, G),
                   ncol = 1,
                   nrow = G)
  } else {
    # use two column layout for sub-plots
      layout <- matrix(seq(1, 2 * ceiling(G / 2)),
                   ncol = 2,
                   nrow = ceiling(G / 2))
  }
  # Set up the page
  pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
  
  data_pred <- data.frame(time = time)
  pred <- predictY(models[[G]],
                   data_pred,
                   var.time = "time",
                   draws = TRUE)
  lcmm_uit <- as.data.frame(pred$pred)
  lcmm_uit$time <- time
  
  if(is.null(mapping)) {
    mapping <- 1:G
  }
  
  
  for (g in mapping) {
    matchidx <- as.data.frame(which(layout == g, arr.ind = TRUE))
    id.group <- models[[G]]$pprob[models[[G]]$pprob[, 2] == mapping[g], 1]
    if (sizes) {
      message("There are ",
              length(id.group),
              " subjects in cluster ",
              g,
              ".")
    }
    if(!log){
      p[[g]] <- ggplot(data = subset(FCcumulative, id %in% id.group),
                     aes(x = time, y = exp(value))) +
      geom_line(aes(group = id), alpha = 0.1) +
      theme_minimal() + 
      geom_line(data = lcmm_uit,
                aes(x = time,
                    y = exp(lcmm_uit[, paste0("Ypred_class", mapping[g])])),
                size = 1.5,
                col = "red") +
      geom_line(data = lcmm_uit,
                aes(x = time,
                    y = exp(lcmm_uit[, paste0("lower.Ypred_class",
                                              mapping[g])])),
                col = "red",
                lty = 2) +
      geom_line(data = lcmm_uit,
                aes(x = time,
                    y = exp(lcmm_uit[, paste0("upper.Ypred_class",
                                              mapping[g])])),
                col = "red",
                lty = 2) +
      xlab("Time (years)") +
      ylab("FCAL (μg/g)") +
      ylim(0, 2500)
    } else{
       p[[g]] <- ggplot(data = subset(FCcumulative, id %in% id.group),
                     aes(x = time, y = value)) +
      geom_line(aes(group = id), alpha = 0.1) +
      theme_minimal() + 
      geom_line(data = lcmm_uit,
                aes(x = time, y = lcmm_uit[, paste0("Ypred_class",
                                                    mapping[g])]),
                size = 1.5,
                col = "red") +
      geom_line(data = lcmm_uit,
                aes(x = time, y = lcmm_uit[, paste0("lower.Ypred_class",
                                                    mapping[g])]),
                col = "red",
                lty = 2) +
      geom_line(data = lcmm_uit,
                aes(x = time, y = lcmm_uit[, paste0("upper.Ypred_class",
                                                    mapping[g])]),
                col = "red",
                lty = 2) +
      geom_hline(yintercept = log(250),
                 color = "#007add",
                 lty = 3,
                 size = 1.5) +
      xlab("Time (years)") +
      ylab("Log (FCAL (μg/g))") +
      ylim(2, log(2500))
    }
    if (multi) {
      if (save) {
        # Add subfigure labels
        print(p[[g]] +
                ggtitle(labels[g]) +
                theme_classic() +
                theme(axis.line = element_line(colour = "gray"),
                      plot.title = element_text(face = "bold",
                                                size = 20)),
              vp = viewport(layout.pos.row = matchidx$row,
                            layout.pos.col = matchidx$col)
              )
      } else {
        print(p[[g]],
              vp = viewport(layout.pos.row = matchidx$row,
                            layout.pos.col = matchidx$col)
              )
      }
    } else {
      print(p[[g]])
    }
  }
}
```

To achieve our aims of identifying clusters within the CD population with 
similar FCAL profiles, we use latent class mixed models (LCMMs) with natural
cubic spline formulations for the fixed and random effects components. LCMMs
are an extension of linear mixed effects models with an added fixed
effect class-specific component. Cluster membership in a LCMM is given via
a multinomial logistic model.

Previously, we have investigated using polynomial regression models with an
I-splines link function, and models which use Gaussian radial basis functions
(GRBFs) in order to model the fixed and random components of an LCMM.
Unfortunately, the polynomial regression and I-splines link approaches
demonstrated inflexibility and, in the case of polynomial regression behaved
erratically near the ends of the time period.

The GRBF approach was very sensitive to $l$, a length scale parameter, and the
iterative Marquardt algorithm implemented in the `{lcmm}` package did not
converge in many cases- possibly due the number of parameters required to be
estimated and/or the Runge phenomenon [@Fornberg2007].

This has led us to consider an approach using natural cubic splines which has a
few notable advantages [@Elhakeem2022]:

1. Less parameters need to be estimated than either a Gaussian radial basis
   function regression model or a polynomial regression model with the same
   flexibility . This reduces the time complexity when fitting the model and in
   the future may also make extensions more practically feasible. 
2. Natural cubic splines enforce linearity between $t_0$ and the first knot and
  between the last knot and $t_\text{max}$ which ensures the model does not
  behave erratically in these sometimes problematic areas. 
3. Natural cubics are not highly sensitive to a continuous parameter and instead
  requires only $K$, the number of knots, to be tuned: being robust to where
  the knots themselves are placed.

## Formal defintions

For formal definitions of the models and statistics we have used in the work,
please see the supplementary material for our paper. 

## The Crohn's Disease Inception Cohort

The background for the data we will fit to the models and an explanation of
the data processing steps implemented can be found on a
[dedicated page](data-cleaning.qmd). Due to the distribution of the FCAL
values (@fig-dist), FCAL values have been log-transformed prior to model fitting
(@fig-dist-log).

```{R Log transform}
#| label: fig-dist
#| fig-cap: "Distribution of FCAL values when in measurement units."
FCcumulative %>%
  ggplot(aes(x = value, y = NULL)) +
  stat_slab(size = 0.8, alpha = 0.5, fill = "#235789") +
  geom_dots(binwidth = 10, size = 1, side = "bottom", color = "#235789") +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("FCAL (µg/g)") +
  ylab("") +
  ggtitle("Distribution of FCAL Measurements",
          "Crohn's disease inception cohort")
```

```{R}
#| label: fig-dist-log
#| fig-cap: "Distribution of FCAL values after a log transformation has been applied."
FCcumulative$value <- log(FCcumulative$value)

FCcumulative %>%
  ggplot(aes(x = value, y = NULL)) +
  stat_slab(size = 0.8, alpha = 0.5, fill = "#235789") +
  geom_dots(binwidth = 0.02, size = 1, side = "bottom", color = "#235789") +
  theme_minimal() +
  theme(axis.text.y = element_blank()) +
  xlab("Log (FCAL (µg/g))") +
  ylab("") +
  ggtitle("Distribution of Log-Transformed FCAL Measurements",
          "Crohn's disease inception cohort")
```


## Model fitting

LCMMs with 2 - 6 assumed clusters are considered. As recommended by
@Proust-Lima2017, a model is initially fitted with one cluster (I.E a regular
linear mixed effects model) which is used to sample initial values in  a grid
search approach which attempts to find optimal models for each assumed number of
clusters based upon maximum likelihood. The trajectory of this linear mixed
effects model is given by @fig-lme. 

For the fixed and random components of each model, we will consider natural
cubic splines of time with three knots (I.E five fixed points including the
boundaries of the splines. The knots for the natural cubic splines are placed at
the 1<sup>st</sup> quantile, median, and 3<sup>rd</sup>
quartile of the FCAL measurement times for the study cohort. This corresponds to
[`r round(attr(ns(FCcumulative$time, df = 4), "knots"), 2)`] years from
diagnosis.


```{R Model fitting}
#| fig-width: 12
#| fig-height: 6.75
#| label: fig-lme
#| fig-cap: "Linear mixed effects (LME) model fitted to data and used to generate inital values for the grid search method used for LCMMs with $G > 2$"
ngroups <- c(2, 3, 4, 5, 6)
rep <- 50
maxiter <- 10
if (!file.exists("cache/cubicbf.fits.RDS") | !cache.models){
  m1 <-  hlme(fixed =  value ~ ns(time, df = 4),
              random = ~  ns(time, df = 4),
              subject = "id",
              data = FCcumulative,
              verbose = FALSE,
              var.time = "time",
              maxiter = 8000)
  print(summary(m1))
  if (!m1$conv) stop("LME did not converge \n")
  
  hlme.metrics <- matrix(nrow = 0, ncol = 3)
  colnames(hlme.metrics) <- c("maximum log-likelihood", "AIC", "BIC")
  temp <- matrix(c(m1$loglik, m1$AIC, m1$BIC),  nrow = 1)
  rownames(temp) <- "1"
  hlme.metrics <- rbind(hlme.metrics, temp)
  
  
  cubicbf.fits <- list()
  cubicbf.fits[["group1"]] <- m1

  for (ngroup in ngroups) {
    ng <- ngroup
    cl <- parallel::makeCluster(parallel::detectCores())
    parallel::clusterExport(cl, "ng")
    hlme.fit <- gridsearch(
      rep = rep,
      maxiter = maxiter,
      minit = m1,
      cl = cl,
      hlme(fixed =  value ~ ns(time, df = 4),
           mixture = ~  ns(time, df = 4),
           random = ~  ns(time, df = 4),
           subject = "id",
           ng = ng,
           data = FCcumulative,
           verbose = FALSE)
    )
    parallel::stopCluster(cl)
    
    cubicbf.fits[[paste0("group", ngroup)]] <- hlme.fit
  
    if (hlme.fit$conv) {
      cat("Convergence achieved for ", ng, "subgroups ✅ \n")
    } else {
      cat("Convergence NOT achieved for ", ng, " subgroups ⚠️ \n")
    }
    
    temp <- matrix(c(hlme.fit$loglik, hlme.fit$AIC, hlme.fit$BIC),  nrow = 1)
    rownames(temp) <- ngroup
    hlme.metrics <- rbind(hlme.metrics, temp)
  }
  saveRDS(cubicbf.fits, "cache/cubicbf.fits.RDS")
  saveRDS(hlme.metrics, "cache/cubicbf.RDS")
} else{
  cubicbf.fits <- readRDS("cache/cubicbf.fits.RDS")
  hlme.metrics <- readRDS("cache/cubicbf.RDS" )
}

m1 <- cubicbf.fits[[1]]
x <- predictY(m1,
              newdata = data.frame(time = seq(0, 5, by = 0.01)),
              var.time = "time",
              draws = TRUE)
par(mfrow = c(1, 2))

plot(FCcumulative$time,
     FCcumulative$value,
     xlab = "Time",
     ylab = "Log(FCAL)",
     main = "LME with Cubic Natural Splines (Log Scale)",
     col = rgb(0,0,0, 0.3))
lines(x$times[,1], x$pred[,1], col = "red")
lines(x$times[,1], x$pred[,2], col = "red", lty = 2) # Conf intervals
lines(x$times[,1], x$pred[,3], col = "red", lty = 2) 

# Plot knots as vertical lines
abline(v = as.numeric(attr(ns(FCcumulative$time, df = 4),
                           "knots")),
       col = "blue",
       lty = 3)
 
plot(FCcumulative$time,
     exp(FCcumulative$value),
     xlab = "Time",
     ylab = "FCAL",
     main = "LME with Cubic Natural Splines (Measurement Scale)",
     col = rgb(0,0,0, 0.3))
lines(x$times[, 1], exp(x$pred[, 1]), col = "red")
lines(x$times[, 1], exp(x$pred[, 2]), col = "red", lty= 2)
lines(x$times[, 1], exp(x$pred[, 3]), col = "red", lty= 2)
abline(v = as.numeric(attr(ns(FCcumulative$time, df = 4), "knots")),
       col = "blue", lty = 3)
par(mfrow = c(1,1))
```

```{R Save LME, include = FALSE}
png(file = "LME.experiment.png", width = 16, height = 9, units = "in", res = 300)
par(mfrow = c(1, 2))

plot(FCcumulative$time,
     FCcumulative$value,
     xlab = "Time",
     ylab = "Log(FCAL)",
     main = "LME with Cubic Natural Splines (Log Scale)",
     col = rgb(0,0,0, 0.3))
lines(x$times[, 1], x$pred[, 1], col = "red")
lines(x$times[, 1], x$pred[, 2], col = "red", lty = 2) # Conf intervals
lines(x$times[, 1], x$pred[, 3], col = "red", lty = 2) 

# Plot knots as vertical lines
abline(v = as.numeric(attr(ns(FCcumulative$time, df = 4), "knots")),
       col = "blue",
       lty = 3)
 
plot(FCcumulative$time,
     exp(FCcumulative$value),
     xlab = "Time",
     ylab = "FCAL",
     main = "LME with Cubic Natural Splines (Measurement Scale)",
     col = rgb(0,0,0, 0.3))
lines(x$times[, 1], exp(x$pred[, 1]), col = "red")
lines(x$times[, 1], exp(x$pred[, 2]), col = "red", lty= 2)
lines(x$times[, 1], exp(x$pred[, 3]), col = "red", lty= 2)
abline(v = as.numeric(attr(ns(FCcumulative$time, df = 4), "knots")),
       col = "blue",
       lty = 3)
dev.off()
par(mfrow = c(1, 1))
```

## Model selection

### Model fit 

We consider two metrics when considering model fit: AIC and BIC which penalises
model complexity.

```{R Fit statistics}
#| fig-width: 10
#| fig-height: 5.5

groups.1 <- cubicbf.fits[[1]]
groups.2 <- cubicbf.fits[[2]]
groups.3 <- cubicbf.fits[[3]]
groups.4 <- cubicbf.fits[[4]]
groups.5 <- cubicbf.fits[[5]]
groups.6 <- cubicbf.fits[[6]]

DTbuild(hlme.metrics,
        caption = "Fit Metrics for Natural Cubic Splines Model")
```


Considering all of the models above, AIC is most optimal for the $G = 5$ 
model and BIC is optimal for the $G = 2$ model. However, considering an alluvial
plot (@fig-alluvial) suggests neither $G = 2$ nor $G = 5$ are suitable models. 
The $G = 2$ model clearly has additional well defined cluster not properly
represented by just two clusters, whilst the $G=5$ model results in an 
incredibly small new cluster. As such, $G = 4$ may be a more suitable
alternative.

### Cluster discrimination

#### Alluvial plots

```{R Alluvial plots}
#| warning: false
#| label: fig-alluvial
#| fig-cap: "Alluvial plot demonstrating how cluster membership changes as the assumed number of clusters increase. The height of each band indicates the size of each cluster."
re_label <- function(old.G, new.G, alluvial.df){
  
  new.order <- rep(new.G, new.G)
  old.clusters <- subset(alluvial.df, G == old.G)
  new.clusters <- subset(alluvial.df, G == new.G)

  for (g in old.G:1) {
    ids <- subset(old.clusters, class == g)$id
    for (new.g in 1:new.G) {
      new.clusters.g <- subset(new.clusters, new.g == class)
      if (nrow(subset(new.clusters.g, id %in% ids)) > 0.5 * length(ids)) {
        new.order[new.g] <- g 
      }
    }
  }
  
  alluvial.df[alluvial.df[, "G"] == new.G, "class"] <-
    plyr::mapvalues(alluvial.df[alluvial.df[, "G"] == new.G, "class"],
                    from = seq(1, new.G),
                    new.order)
  return(alluvial.df)
}

# convert to alluvial format
alluvial.df <-cbind(groups.2$pprob[, 1:2], G = 2)
alluvial.df <-rbind(alluvial.df, cbind(groups.3$pprob[, 1:2], G = 3))
alluvial.df <-rbind(alluvial.df, cbind(groups.4$pprob[, 1:2], G = 4))
alluvial.df <-rbind(alluvial.df, cbind(groups.5$pprob[, 1:2], G = 5))
alluvial.df <-rbind(alluvial.df, cbind(groups.6$pprob[, 1:2], G = 6))
alluvial.df$id <- as.character(alluvial.df$id)
alluvial.df$class <- as.factor(alluvial.df$class) 

# eliminate label switching
alluvial.df <- re_label(2, 3, alluvial.df)
alluvial.df <- re_label(3, 4, alluvial.df)
alluvial.df <- re_label(4, 5, alluvial.df)
alluvial.df <- re_label(5, 6, alluvial.df)

p <- ggplot(alluvial.df,
            aes(x = G,
            stratum = class,
            alluvium = id,
            fill = class,
            label = class)) + 
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow()  +
  geom_stratum(alpha = 0.5) +
  geom_text(stat = "stratum", size = 3) +
  theme_classic() +
  theme(axis.line = element_line(colour = "gray")) +
  theme(legend.position = "none") +
  ggtitle("Alluvial plot of cluster membership across G",
          "Crohn's disease inception cohort") +
  scale_fill_manual(values = c("#e3281f",
                               "#3aa534",
                               "#ff5885",
                               "#fbc926",
                               "#511e9d",
                               "black")) +
  xlab("Assumed number of clusters") + 
  ylab("Frequency")
print(p)

p <- p + ggtitle("", "")
ggsave("paper/alluvial.png", p, width = 8, height = 4.5, units = "in")
ggsave("paper/alluvial.pdf", p, width = 8, height = 4.5, units = "in")
```

#### Posterior classifications

An alternative to co-clustering when considering cluster discrimination is to
consider posterior classification possibilities. From the below data, we can
see how these posterior probabilities change as the number of assumed clusters
increase

::: {.panel-tabset}

##### G = 2

```{R Pprob G2}
postprob(cubicbf.fits[[2]])
```

##### G = 3

```{R Pprob G3}
postprob(cubicbf.fits[[3]])
```

##### G = 4

```{R Pprob G4}
postprob(cubicbf.fits[[4]])
```


##### G = 5

```{R Pprob G5}
postprob(cubicbf.fits[[5]])
```

##### G = 6

```{R Pprob G6}
postprob(cubicbf.fits[[6]])
```

:::

```{R Pprob distributions}
#| fig-width: 8
#| fig-height: 4.5
pprobs.2 <- c()
pprobs.3 <- c()
pprobs.4 <- c()
pprobs.5 <- c()
pprobs.6 <- c()
for (i in 1:nrow(cubicbf.fits[[1]]$pprob)){
  class.2 <- groups.2$pprob[i, 2]
  pprobs.2 <- c(pprobs.2, groups.2$pprob[i, class.2 + 2 ])
  class.3 <- groups.3$pprob[i, 2]
  pprobs.3 <- c(pprobs.3, groups.3$pprob[i, class.3 + 2 ])
  class.4 <- groups.4$pprob[i, 2]
  pprobs.4 <- c(pprobs.4, groups.4$pprob[i, class.4 + 2 ])
  class.5 <- groups.5$pprob[i, 2]
  pprobs.5 <- c(pprobs.5, groups.5$pprob[i, class.5 + 2 ])
  class.6 <- groups.6$pprob[i, 2]
  pprobs.6 <- c(pprobs.6, groups.6$pprob[i, class.6 + 2 ])
}
pprobs.2 <- tibble(prob = pprobs.2)
pprobs.3 <- tibble(prob = pprobs.3)
pprobs.4 <- tibble(prob = pprobs.4)
pprobs.5 <- tibble(prob = pprobs.5)
pprobs.6 <- tibble(prob = pprobs.6)

pprobs.2$Model <- as.factor(rep("Two clusters", nrow(pprobs.2)))
pprobs.3$Model <- as.factor(rep("Three clusters", nrow(pprobs.3)))
pprobs.4$Model <- as.factor(rep("Four clusters", nrow(pprobs.4)))
pprobs.5$Model <- as.factor(rep("Five clusters", nrow(pprobs.5)))
pprobs.6$Model <- as.factor(rep("Six clusters", nrow(pprobs.6)))
pprobs <- rbind(pprobs.2, pprobs.3, pprobs.4, pprobs.5, pprobs.6)

p <- pprobs %>%
  ggplot(aes(x = prob, y = Model)) +
  #geom_histogram(bins = 40, fill = NA, position="identity")
  stat_slab(aes(fill = Model),color = "gray",
                    size = 0.8,
                    alpha = 0.2) +
  geom_dots(aes(fill = Model, color = Model), dotsize = 1) +
  xlab("Posterior probability for cluster membership") +
  ylab("") + 
  ggtitle("Distribution of Posterior Probabilities Across Models",
          "Subject-specific posterior probabilities for assigned cluster") +
  theme_minimal() + 
  scale_color_manual(values = c("#e3281f",
                                "#3aa534",
                                "#ff5885",
                                "#fbc926",
                                "#511e9d")
                     ) +
  scale_fill_manual(values = c("#e3281f",
                               "#3aa534",
                               "#ff5885",
                               "#fbc926",
                               "#511e9d")
                    ) +
  scale_y_discrete(limits = rev)
print(p)
ggsave("plots/Distributions.png", p, width = 8.5, height = 4.5, units = "in")
ggsave("plots/Distributions.pdf", p, width = 8.5, height = 4.5, units = "in")
```

### Residual plots

To ensure model assumptions are not violated, residual plots are also consulted.
The residual plots are very similar across all models considered and are
reassuring. As we later decide on the $G = 4$ model
([see the Spaghetti plots per cluster section](#spaghetti-plots-per-cluster)),
we have generated additional plots examining the normality of the residuals for
this model. 

::: {.panel-tabset}

#### G = 1

```{R Save residual G1}
#| include: false
png("plots/residuals/g1.png",
    width = 16,
    height = 9,
    units = "in",
    res = 300)
plot(cubicbf.fits[[1]], shades = TRUE)
dev.off()
```

```{R Print residual G1}
#| fig-width: 11
#| fig-height: 8
plot(cubicbf.fits[[1]], shades = TRUE)
```

#### G = 2

```{R Save residual G2}
#| include: false
png("plots/residuals/g2.png",
    width = 16,
    height = 9,
    units = "in",
    res = 300)
plot(cubicbf.fits[[2]], shades = TRUE)
dev.off()
```


```{R Print residual G2}
#| fig-width: 11
#| fig-height: 8
plot(cubicbf.fits[[2]], shades = TRUE)
```

#### G = 3

```{R Save residual G3}
#| include: false
png("plots/residuals/g3.png",
    width = 16,
    height = 9,
    units = "in",
    res = 300)
plot(cubicbf.fits[[3]], shades = TRUE)
dev.off()
```

```{R Print residual G3}
#| fig-width: 11
#| fig-height: 8
plot(cubicbf.fits[[3]], shades = TRUE)
```

#### G = 4

```{R Save residual G4}
#| include: false
png("plots/residuals/g4.png",
    width = 16,
    height = 9,
    units = "in",
    res = 300)
plot(cubicbf.fits[[4]], shades = TRUE)
dev.off()
```

```{R Print residual G4}
#| fig-width: 11
#| fig-height: 8
plot(cubicbf.fits[[4]], shades = TRUE)
```


```{R}
#| results: "hold"
#| label: fig-qqnorm-pop
#| fig-cap: "(A) Density plot of residuals for the four-cluster model. (B) Quantile-quantile plot of residuals for the four-cluster model."
#| fig-width: 6.5
#| fig-height: 6.5
p1 <- data.frame(residuals = resid(cubicbf.fits[[4]])) %>%
  ggplot(aes(x = residuals)) +
  geom_histogram(aes(y = ..density..),
                 fill = "#D8829D",
                 color = "#AF6A80",
                 bins = 30) +
  geom_density(color = "#023777", size = 1.2) +
  theme_classic() +
  theme(axis.line = element_line(colour = "gray")) +
  ylab("Density") +
  xlab("Residuals") +
  ggtitle("A")

p2 <- data.frame(residuals = resid(cubicbf.fits[[4]])) %>%
  ggplot(aes(sample = residuals)) +
    stat_qq_band() +
    stat_qq_line(color = "#D8829D") +
    stat_qq_point(color = "#023777")  +
    theme_classic() +
    theme(axis.line = element_line(colour = "gray")) + 
    ylab("Theoretical Quantiles") +
    xlab("Sample Quantiles") +
  ggtitle("B")
ggsave("paper/Residual-plot.pdf",
       plot = p1 / p2,
       width = 8,
       height = 8,
       units = "in")
print(p1 / p2)
```

```{R}
#| label: fig-qqnorm-cluster
#| fig-cap: "Quantile-quantile plots of residuals for the four-cluster latent class mixed model stratified by (A) cluster 1; (B) cluster 2; (C) cluster 3; and (D) cluster 4."
dict <- cubicbf.fits[[4]]$pprob[, c("id", "class")]
dict$class <- plyr::mapvalues(dict$class,
                              from = c(1, 2, 3, 4),
                              to = c(4, 3, 1, 2))
temp <- merge(cubicbf.fits[[4]]$pred, dict, by = "id")
par(mfrow = c(2,2))
labels <- c("A", "B", "C", "D")
p <- list()

for (i in 1:4) {
  p[[i]] <- subset(temp, class == i) %>%
    ggplot(aes(sample = resid_ss)) +
    stat_qq_band() +
    stat_qq_line(color = "#D8829D") +
    stat_qq_point(color = "#023777")  +
    theme_classic() +
    theme(axis.line = element_line(colour = "gray")) + 
    ylab("Theoretical Quantiles") +
    xlab("Sample Quantiles") +
  ggtitle(labels[i])
}
ggsave("paper/cluster-resids.pdf",
       (p[[1]] + p[[2]]) / (p[[3]] + p[[4]]),
       width = 8,
       height = 8,
       units = "in")
print((p[[1]] + p[[2]]) / (p[[3]] + p[[4]]))
```

#### G = 5

```{R Save residual G5}
#| include: false
png("plots/residuals/g5.png",
    width = 16,
    height = 9,
    units = "in",
    res = 300)
plot(cubicbf.fits[[5]], shades = TRUE)
dev.off()
```

```{R Print residual G5}
#| fig-width: 11
#| fig-height: 8
plot(cubicbf.fits[[5]], shades = TRUE)
```

#### G = 6

```{R Save residual G6}
#| include: false
png("plots/residuals/g5.png",
    width = 16,
    height = 9,
    units = "in",
    res = 300)
plot(cubicbf.fits[[5]], shades = TRUE)
dev.off()
```

```{R Print residual G6}
#| fig-width: 11
#| fig-height: 8
plot(cubicbf.fits[[6]], shades = TRUE)
```

:::

### Spaghetti plots per cluster

Plotting the mean cluster trajectories alongside spaghetti plots of all subject
trajectories provides evidence for the $G = 4$ model being the
most appropriate.  

::: {.panel-tabset}

#### For G = 2

##### Log-scale, all subjects

```{R Spaghetti G2 log all subjects}
#| fig-width: 11
#| fig-height: 8
spaghetti_plot(FCcumulative, cubicbf.fits, G = 2, log = TRUE, sizes = TRUE)
```

```{R}
#| include: false
cairo_pdf("paper/sup/all_trajectories_2_log.pdf", width = 8.25, height = 4)
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 2,
               log = TRUE,
               column = FALSE,
               save = TRUE)
dev.off()
```

##### Measurement-scale, all subjects

```{R Spaghetti G2 measurement scale all subjects}
#| fig-width: 11
#| fig-height: 8
spaghetti_plot(FCcumulative, cubicbf.fits, G = 2, log = FALSE)
```

##### Log-scale, pprob > 0.8 only

```{R Spaghetti G2 log cutoff}
#| fig-width: 11
#| fig-height: 8
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 2,
               log = TRUE,
               pprob.cutoff = 0.8)
```

##### Measurement-scale, pprob > 0.8 only

```{R Spaghetti G2 measurement scale cutoff}
#| fig-width: 11
#| fig-height: 8
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 2,
               log = FALSE,
               pprob.cutoff = 0.8)

```


#### For G = 3

##### Log-scale, all subjects

```{R Spaghetti G3 log all subjects}

#| fig-width: 11
#| fig.height: 8
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 3,
               log = TRUE,
               mapping = c(1,3,2),
               sizes = TRUE)
```

```{R}
#| include: false
cairo_pdf("paper/sup/all_trajectories_3_log.pdf", width = 8.25, height = 8)
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 3,
               log = TRUE,
               column = FALSE,
               save = TRUE, 
               mapping = c(1,3,2))
dev.off()
```

##### Measurement-scale, all subjects

```{R Spaghetti G3 measurement scale all subjects}
#| fig-width: 11
#| fig.height: 8
spaghetti_plot(FCcumulative, cubicbf.fits, G = 3, log = FALSE)
```

##### Log-scale, pprob > 0.8 only

```{R Spaghetti G3 log cutoff}
#| fig-width: 11
#| fig.height: 8
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 3,
               log = TRUE,
               pprob.cutoff = 0.8)
```

##### Measurement-scale, pprob > 0.8 only

```{R Spaghetti G3 measurement scale cutoff}
#| fig-width: 11
#| fig.height: 8
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 3,
               log = FALSE,
               pprob.cutoff = 0.8)
```

#### For G = 4

Class membership has been relabelled to ensure consistency with the alluvial
plot

##### Log-scale, all subjects

```{R Spaghetti G4 log all subjects}
#| fig-width: 11
#| fig-height: 8
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 4,
               log = TRUE,
               mapping = c(3,2,4,1),
               sizes = TRUE)
```

```{R}
#| include: false
cairo_pdf("paper/all_trajectories_log.pdf", width = 8.25, height = 8)
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 4,
               log = TRUE,
               column = FALSE,
               mapping = c(3, 2, 4, 1),
               save = TRUE)
dev.off()
```

##### Measurement-scale, all subjects

```{R Spaghetti G4 measurement scale all subjects}
#| fig-width: 11
#| fig-height: 8
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 4,
               log = FALSE,
               mapping = c(3, 2, 4, 1))
```

```{R}
#| include: false
cairo_pdf("paper/sup/all_trajectories.pdf", width = 8.25, height = 8)
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 4,
               log = FALSE,
               column = FALSE,
               mapping = c(3, 2, 4, 1), 
               save = TRUE)
dev.off()
```

##### Log-scale, pprob > 0.8 only

```{R Spaghetti G4 log cutoff}
#| fig-width: 11
#| fig-height: 8
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 4,
               log = TRUE,
               pprob.cutoff = 0.8,
               mapping = c(3, 2, 4, 1))
```

##### Measurement-scale, pprob > 0.8 only

```{R Spaghetti G4 measurement scale cutoff}
#| fig-width: 11
#| fig-height: 8
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 4,
               log = FALSE,
               pprob.cutoff = 0.8,
               mapping = c(3, 2, 4, 1))
```

#### For G = 5

##### Log-scale, all subjects

```{R Spaghetti G5 log all subjects}
#| fig-width: 11
#| fig-height: 8
#| message: true
#| warning: false
spaghetti_plot(FCcumulative, cubicbf.fits, G = 5, log = TRUE, sizes = TRUE)
```

```{R}
#| include: false
cairo_pdf("paper/sup/all_trajectories_5_log.pdf", width = 8.25, height = 12)
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 5,
               log = TRUE,
               column = FALSE,
               save = TRUE,
               mapping = c(1, 5, 3, 2, 4)) 
dev.off()
```

##### Measurement-scale, all subjects

```{R Spaghetti G5 measurement scale all subjects}
#| fig-width: 11
#| fig-height: 8
#| warning: false
spaghetti_plot(FCcumulative, cubicbf.fits, G = 5, log = FALSE)
```

```{R}
#| include: false
cairo_pdf("paper/sup/all_trajectories-5.pdf", width = 8.25, height = 8)
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 5,
               log = FALSE,
               column = FALSE,
               mapping = c(1, 5, 3, 2, 4), 
               save = TRUE)
dev.off()
```

##### Log-scale, pprob > 0.8 only

```{R Spaghetti G5 log cutoff}
#| fig-width: 11
#| fig-height: 8
#| warning: false
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 5,
               log = TRUE,
               pprob.cutoff = 0.8)
```

##### Measurement-scale, pprob > 0.8 only

```{R Spaghetti G5 measurement scale cutoff}
#| fig-width: 11
#| fig-height: 8
#| warning: false
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 5,
               log = FALSE,
               pprob.cutoff = 0.8)
```

#### For G = 6

##### Log-scale, all subjects

```{R Spaghetti G6 log all subjects}
#| fig-width: 11
#| fig-height: 8
spaghetti_plot(FCcumulative, cubicbf.fits, G = 6, log = TRUE, sizes = TRUE)
```

```{R}
#| include: false
cairo_pdf("paper/sup/all_trajectories_6_log.pdf", width = 8.25, height = 12)
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 6,
               log = TRUE,
               column = FALSE,
               save = TRUE, 
               mapping = c(2, 3, 1, 4, 6, 5))
dev.off()
```

##### Measurement-scale, all subjects

```{R Spaghetti G6 measurement scale all subjects}
#| fig-width: 11
#| fig-height: 8
spaghetti_plot(FCcumulative, cubicbf.fits, G = 6, log = FALSE)
```

##### Log-scale, pprob > 0.8 only

```{R Spaghetti G6 log cutoff}
#| fig-width: 11
#| fig-height: 8
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 6,
               log = TRUE,
               pprob.cutoff = 0.8)
```

##### Measurement-scale, pprob > 0.8 only

```{R Spaghetti G6 measurement scale cutoff}
#| fig-width: 11
#| fig-height: 8
spaghetti_plot(FCcumulative,
               cubicbf.fits,
               G = 6,
               log = FALSE,
               pprob.cutoff = 0.8)
```

:::

### Model output

After considering all of the above findings, the model which assumes four
clusters has been deemed the most appropriate. The summary statistics for this
model can be found below. The four splines are denoted by
$X_{1}(t), \ldots, X_{4}(t)$. We use this notation in the supplementary
materials for the paper. 

```{R}
### Match notation to supp. materials
x <- cubicbf.fits[[4]]
x$Xnames <- c("Intercept", paste0("X", seq(1, 4), "(t)")) # Random effects
names(x$best) <- c(paste("Intercept class ", seq(1, 3)),# Class membership model
                   paste("Intercept class ", seq(1, 4)),# Longitudinal model
                   paste("X1(t) class ", seq (1, 4)),
                   paste("X2(t) class ", seq (1, 4)),
                   paste("X3(t) class ", seq (1, 4)),
                   paste("X4(t) class ", seq (1, 4)),
                   paste("Varcov ", seq(1, 15)),
                   "Standard error")
summary(x)
```

## Session information

```{R Session info}
#| echo: false
pander::pander(sessionInfo())
```


## {.appendix}

<div class = "center">
<img class = "center" src="images/MRC_HGU_Edinburgh RGB.png" alt="MRC Human Genetics Unit logo" height = 50px>
<img src="images/cgem-logo.png" alt="Centre for Genomic & Experimental Medicine logo" height = 50px> 
</div>

## Acknowledgments {.appendix}

This work is funded by the Medical Research Council & University of Edinburgh
via a Precision Medicine PhD studentship (MR/N013166/1, to **NC-C**) 

## Author contributions {.appendix}

**NC-C** wrote the analysis. **KM** and **CAV** performed code review and
contributed suggestions. **KM**, **RM** and **CAV** provided feedback. 

## Reuse {.appendix}

Licensed by 
<a href="https://creativecommons.org/licenses/by/4.0/">CC BY</a>
except for the MRC Human Genetics Unit, The University of Edinburgh, and Centre for Genomic & Experimental Medicine logos or unless otherwise stated.