CoDa_workflow.Rmd

---
title: CoDa workflows for exploring mineral nutrient compositions of cereal grains
author: M.G. Walsh, R. Rodríguez Iglesias, R. Manners, P.C. Nalivata and M.R. Broadley
date: "Last compiled on `r format(Sys.time(), '%d, %B, %Y')`"
output:
  html_document:
    toc: true
    toc_depth: 2
    fig_caption: true
    css: style.css
---

```{r, echo=FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```

# Introduction

Most empirical models of biogeography, species distributions, photosynthesis, nutrient uptake, yield, trophic dynamics, crop management and/or small area dietary requirements assume no interactions between the essential mineral nutrient elements and typically describe the associated processes as a function of only the potentially most limiting parts (i.e., [Liebig's law](https://en.wikipedia.org/wiki/Justus_von_Liebig)). However, *the law of the minimum* does not hold consistently at plant community or ecosystem scales (see e.g., [Danger et al., 2008]( https://doi.org/10.2134/agronj1993.00021962008500030040x)). In agricultural systems, overall mineral nutrient compositions are likely to affect both crop yields e.g., see the recent article by [Aliyu et al., (2021)](https://doi.org/10.1038/s41598-021-95172-7), as well as the overall nutritional quality of the plant products that are subsequently consumed by animals, including humans. 

The objective of this notebook is to provide starter code in [R](https://www.r-project.org/) for exploratory data analyses ([EDA](https://en.wikipedia.org/wiki/Exploratory_data_analysis)) and reporting of the mineral nutrient compositions of the main cereal grains (maize, pearl-millet, rice and sorghum) that are grown in Malawi. The markdown notebook presented here is maintained on [Github](https://github.com/mgwalsh/Bioavailability/blob/master/CoDa_workflow.Rmd), and you can fork and alter it from there for your reference and as you see fit.

*Compositional Data Analysis* (CoDA) refers to the analyses of *compositional data* (CoDa). CoDa are defined as vectors with strictly positive components whose sum is constant (e.g, 1 or 100%). The terms also cover all other *parts* of a whole (so-called *subcompositions*), which carry only relative information ([van den Boogaart and Tolosana-Delgado, 2013](https://link.springer.com/book/10.1007/978-3-642-36809-7)). Because a change in any proportion of a whole changes at least one other proportion, CoDa are interdependent. Anything that is part of a whole is compositional. So arguably, all [ionomic data](https://academic.oup.com/plphys/article/136/1/2451/6112383) should be treated as CoDa because as a whole the concentrations must sum to the respective unit totals of the constituent parts i.e., in the case of plants and animals as compositions of water, organic and inorganic materials. In turn, the organic, and in this case specifically the inorganic nutrient subcompositions considered here, should also not be analyzed and interpreted without normalizing them to one another for downstream analyses and diagnostics (see e.g., [Parent et al., 2013](https://doi.org/10.3389/fpls.2013.00039)). 

The original approach to CoDA developed by [J. Aitchison (1982)](https://rss.onlinelibrary.wiley.com/doi/10.1111/j.2517-6161.1982.tb01195.x) uses logarithmic ratios (LRs) of the compositional parts as the fundamental starting point for CoDa description and modeling. There is also a recent CoDa review paper by [M. Greenacre (2021)](https://www.annualreviews.org/doi/pdf/10.1146/annurev-statistics-042720-124436) openly available, which provides excellent guidance, as well as [R scripts](https://github.com/michaelgreenacre/CODAinPractice), about the practical applications and interpretation of CoDa. There is also an example, open source article available that treats genomic survey data in an analogous fashion and uses some of the graphical modeling techniques that we will be illustrating here (see [Friedman and Alm, (2012)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002687), and the references therein).

# General data setup

The data we will be using were generated by the [GeoNutrition project](http://www.geonutrition.com) in Malawi. They consist of 1,809 cereal grain samples that were collected for maize (1,606), millet (31), rice (54), and sorghum (117) in proportion of their national geographical occurrence. The nationally representative cropland sampling frame and the laboratory methods that were used in the data collections are described in [Gashu et al., (2021)](https://doi.org/10.1038/s41586-021-03559-3). 

The data include [ICP-MS](https://en.wikipedia.org/wiki/Inductively_coupled_plasma_mass_spectrometry) concentration measurements of the respective cereal mineral nutrient contents (in mg kg^-1^), including: Na, Mg, P, S, K, Ca, Cr, Mn, Fe, Co, Cu, Zn, Se, Mo, and I. They also include the georeference of where the samples were collected (lon/lat, EPSG:3857), the cereal crop at that location, the source of the grain samples (crop, field stack or store), if or not, fertilizer and/or lime were used in growing the crop, and basic soil properties (pH, soil organic matter and cation exchange capacity). Note that any crop nutrient measurements below their respective limits of detection (>LOD) have been set to 2/3 of the respective minimum observed values. Other imputation methods are also possible in this context see [Greenacre (2021)](https://www.annualreviews.org/doi/pdf/10.1146/annurev-statistics-042720-124436).

## Packages and data download

To actually run this notebook, you will need to load the packages indicated in the chunk directly below to download and assemble the data. The chunk (not shown but in the markdown document) also generates an overview map of where the samples were collected in Malawi ... you can click and zoom into the individual sampling locations.

```{r, results = 'hide'}
# Package names
packages <- c("osfr", "dplyr", "leaflet", "solitude", "compositions", "doParallel", "caret", "caretEnsemble", "MASS", "randomForest", "xgboost", "nnet", "klaR", "stargazer", "archetypes", "arm")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))
```

There will be additional packages that will be need to be installed to run the graphical models presented in a later section of this notebook, which require more customized procedures for installation depending on the R version that you are running. We'll deal with those in the relevant sections below. This next chunk downloads the Malawi data.

```{r, results = 'hide'}
# Create a data folder in your current working directory
dir.create("MW_coda", showWarnings=F)
setwd("./MW_coda")
dir.create("Results")

# Crop data download from OSF at: https://osf.io/btxyc/
osf_retrieve_file("btxyc") %>% osf_download(conflicts = "overwrite")
codat <- read.table("MW_grain_CoDa.csv", header=T, sep=",")

# Soil data download from OSF at: https://osf.io/5tgu3/
osf_retrieve_file("5tgu3") %>% osf_download(conflicts = "overwrite")
soils <- read.table("soil_props.csv", header=T, sep=",")

# Merge crop and soil data
codat <- merge(codat, soils, by = "CropSampleID")

# Download figures at: at: https://osf.io/pnw7g/
osf_retrieve_file("pnw7g") %>% osf_download(conflicts = "overwrite")
```

```{r, echo = FALSE}
# Grain sample locations
w <- leaflet() %>%
  setView(lng = mean(codat$lon), lat = mean(codat$lat), zoom = 7) %>%
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addCircleMarkers(codat$lon, codat$lat, clusterOptions = markerClusterOptions())
w ## plot widget 
```

## CoDa transforms

This section of the notebook sets up a CoDa data frame that includes 2 amalgamations: mineral **macro-nutrients** (i.e., Na, Mg, P, S, K, and Ca ... in atomic number order), and **micro-nutrients** (i.e., Cr, Mn, Fe, Co, Cu, Zn, Se, Mo, and I). Note that the distinction between *macro* and *micro* here is only intended to reflect the recommended daily dietary values ([DDV](https://www.fda.gov/food/new-nutrition-facts-label/daily-value-new-nutrition-and-supplement-facts-labels)) that should be consumed by people, and not their overall nutritional importance and/or benefits.

This next chunk [closes](https://econ-papers.upf.edu/papers/1554.pdf) the CoDa, and calculates *centered logratios* ([CLR](https://www.rdocumentation.org/packages/compositions/versions/2.0-1/topics/clr)). These can then be analyzed, summarized and interpreted with conventional univariate, multivariate, and data modeling methods ([Greenacre, 2021](https://www.annualreviews.org/doi/pdf/10.1146/annurev-statistics-042720-124436)). Note that you could also use other logratios such as the [ALR](https://www.rdocumentation.org/packages/compositions/versions/2.0-1/topics/alr), or the [ILR](https://www.rdocumentation.org/packages/compositions/versions/2.0-1/topics/ilr) and/or different amalgamations of subcompositions, but we shall leave those possibilities for you to explore. We also write out the resulting data frame to your `./MW_coda/Results` directory should you wish to process it in software other than R.

```{r}
# Calculate the CLRs
varc <- c("Na", "Mg", "P", "S", "K", "Ca", "Cr", "Mn", "Fe", "Co", "Cu", "Zn", "Se", "Mo", "I")
combo <- codat[varc]
combo <- combo / rowSums(combo) ## close the composition
combo <- as.data.frame(clr(combo)) ## centered log ratio (clr) transform

# Define the (macro / micro) amalgamation and add it to the overall nutrient composition
varm <- c("Na", "Mg", "P", "S", "K", "Ca")
macro <- rowSums(combo[varm])
combo <- cbind(combo, macro)

# Attach site data info
locv <- c("CropSampleID", "SoilSampleID", "lat", "lon", "crop", "source", "fert", "pH", "soc", "eCEC") 
locs <- codat[locv]
site <- cbind(locs, combo)

# Write file
write.csv(site, "./MW_coda/Results/MW_site_coda.csv", row.names = F)
```

You can plot the raw CLR differences between crop species with something like in this next ugly chunk ...

```{r, fig.align = "center", fig.show = 'hold', fig.height = 12, fig.width = 7.2, fig.cap = "Boxplots of raw nutrient CLR differences between crop species."}

par(mfrow = c(5,3), pty="s", mar=c(2,4,1,1))
boxplot(site$Na ~ site$crop, notch = T, ylab = "Na", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Mg ~ site$crop, notch = T, ylab = "Mg", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$P ~ site$crop, notch = T, ylab = "P", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$S ~ site$crop, notch = T, ylab = "S", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$K ~ site$crop, notch = T, ylab = "K", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Ca ~ site$crop, notch = T, ylab = "Ca", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Cr ~ site$crop, notch = T, ylab = "Cr", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Mn ~ site$crop, notch = T, ylab = "Mn", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Fe ~ site$crop, notch = T, ylab = "Fe", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Co ~ site$crop, notch = T, ylab = "Co", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Cu ~ site$crop, notch = T, ylab = "Cu", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Zn ~ site$crop, notch = T, ylab = "Zn", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Se ~ site$crop, notch = T, ylab = "Se", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Mo ~ site$crop, notch = T, ylab = "Mo", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$I ~ site$crop, notch = T, ylab = "I", xlab = "", cex.axis = 1, cex.lab = 1.3)
```

# Compositional outlier detection

Note that there are quite a few outliers apparent in the boxplot figure above. This short section demonstrates an unspervised anomaly and outlier detection method for CoDa (and other multivariate data, see [Liu, Ting and Zhou (2008)](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/tkdd11.pdf). We will be using the `isolationForest` function from the [`solitude`](https://www.rdocumentation.org/packages/solitude/versions/1.1.1) package here, with default tuning parameters. We also plan to expand on anomaly and outlier detection more in subsequent notebooks, so this next chunk should be considered as an initial draft.

```{r, results = 'hide'}
# Select CoDa variables
coda <- site[ ,11:25]

# Fit isolationForest
iso <- isolationForest$new()
iso$fit(coda)
score <- coda %>% iso$predict()

# Tag outliers based on quantile threshold
thresh <- as.numeric(quantile(score$anomaly_score, probs = 0.95)) ## you can change the quantile level here
coda$outlier <- as.factor(ifelse(score$anomaly_score >= thresh, "y", "n"))
site <- cbind(site, coda$outlier)
names(site)[names(site)=="coda$outlier"] <- "outlier"
```

At the specified 95% quantile threshold of the modeled `anomaly_score` this identifies 91 potential CoDa sample outliers that we will use in some of the subsequent EDAs in this notebook e.g., see the provenance sample tagging section further below.

# Compositional archetypes and unsupervised CoDa classification

*Archetypal analysis* ([Cutler and Breiman, 1994](https://www.stat.cmu.edu/technometrics/90-00/vol-36-04/v3604338.pdf)) is an unsupervised learning method similar to e.g., [K-means clustering](https://towardsdatascience.com/k-means-clustering-explained-4528df86a120). It describes each observation in a dataset as a mixture of archetypes. Rather than describing average or typical observations and measurements, archetypal analysis seeks to identify extreme points on the [convex hull](https://en.wikipedia.org/wiki/Convex_hull) of a data matrix, which can provide a simple and quite useful way of clustering and looking at multivariate data ([Eugster and Leisch, 2009](https://cran.r-project.org/web//packages/archetypes/vignettes/archetypes.pdf)). The identified archetypes themselves are mixtures of the individual CoDA variables under consideration. Archetypal mixture coefficients and projections are also CoDa, which is why we include them in this notebook. Note that other clustering and ordination algorithms could also be used in this context ([Greenacre, 2021](https://www.annualreviews.org/doi/pdf/10.1146/annurev-statistics-042720-124436)).

## Identifying archetypes

This next chunk fits the Malawi CLR data using the [`archetypes`](https://cran.r-project.org/web//packages/archetypes/vignettes/archetypes.pdf) package, which is a robust implementation of the original Cutler and Breiman (1994) algorithm that uses [k-fold validation](https://www.statology.org/k-fold-cross-validation/).

```{r, output = 'hide'}
# Fit nutrient archetypes with 10-fold cross validation
set.seed(85321)
comp.arc <- stepArchetypes(data=site[,11:25], k = 1:10, nrep = 10, verbose = FALSE)
```

```{r, echo = FALSE, fig.align = "center", fig.height = 4, fig.width = 4, fig.cap = "Archetype scree-plot showing the no. of archetypes vs the multivariate residual sums of squares (RSS)."}

par(pty="s", mar=c(4,4,1,1))
screeplot(comp.arc)
```

## Archetypal classification

Similar to other clustering and ordination algorithms, we can use the identified archetypes for dimensional reduction and unsupervised classification.

```{r}
# Select no. of archetypes, the scree plot above suggests 3-7 ... we use 4 here
comp.arc4 <- bestModel(comp.arc[[4]])

# Classify CLRs by "dominant archetypes" (DA)
darc <- as.data.frame(predict(comp.arc4, site[,11:25]))
darc$DA <- apply(darc, 1, which.max)
colnames(darc) <- c("A1","A2","A3","A4","DA")
site <- cbind(site, darc)

# Overwrite the site file
write.csv(site, "./MW_coda/Results/MW_site_coda.csv", row.names = F)
```

As before, you can plot the raw CLR differences between dominant archetypes as: (code not shown, but included in the markdown).

```{r, echo = FALSE, fig.align = "center", fig.show = 'hold', fig.height = 12, fig.width = 7.2, fig.cap = "Boxplots of nutrient CLR differences between dominant archetypes."}

par(mfrow = c(5,3), pty="s", mar=c(2,4,1,1))
boxplot(site$Na ~ site$DA, notch = T, ylab = "Na", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Mg ~ site$DA, notch = T, ylab = "Mg", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$P ~ site$DA, notch = T, ylab = "P", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$S ~ site$DA, notch = T, ylab = "S", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$K ~ site$DA, notch = T, ylab = "K", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Ca ~ site$DA, notch = T, ylab = "Ca", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Cr ~ site$DA, notch = T, ylab = "Cr", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Mn ~ site$DA, notch = T, ylab = "Mn", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Fe ~ site$DA, notch = T, ylab = "Fe", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Co ~ site$DA, notch = T, ylab = "Co", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Cu ~ site$DA, notch = T, ylab = "Cu", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Zn ~ site$DA, notch = T, ylab = "Zn", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Se ~ site$DA, notch = T, ylab = "Se", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$Mo ~ site$DA, notch = T, ylab = "Mo", xlab = "", cex.axis = 1, cex.lab = 1.3)
boxplot(site$I ~ site$DA, notch = T, ylab = "I", xlab = "", cex.axis = 1, cex.lab = 1.3)
```

This is an aside, but it gives you an idea of what the "*most*" dominant archetypal sites look like from a [GeoSurvey](https://osf.io/vxc97/) perspective in Malawi. This should remind us of the fact that there are probably quite explicit spatial links to assessing nutrient compositions that should be explored. However, those links are certainly **not plainly visible in satellite images**, apart from the co-occurrence of croplands, buildings/settlements and their potential anthropic effects apparent in these examples.

```{r, echo=FALSE, fig.align="center", fig.cap="Dominant archetypal grain nutrient composition sites in Malawi visualized in GeoSurvey.", out.width = '90%'}
knitr::include_graphics("./MW_coda/MW_archetypes.png")
```

However, we do expect that some form dimensional CoDa reduction technique will help with some of the subsequent spectral and spatial modeling goals for assessing or monitoring national grain nutrient compositions, and we will follow-up those propositions with additional notebooks.

# Sample provenance tagging and tracking with supervised learning

The main reason for including this section in notebook is that where and from which sources the grain samples were obtained i.e., their *"provenance"* may be important. Provenance could make a difference as to how those respective data should be interpreted for diagnostics and monitoring. In this survey dataset the grain samples were collected either directly from the standing crop, from harvested crop stacks located in the respective fields, or from household grain stores, on both fertilized and unfertilized fields. Both household grain store and fertilized fields are problematic in this context in that we don't actually know what happened after samples were removed from the field, mixed and stored from potentially different farm fields. We also cannot assess what the actual effect of the types and amounts of fertilizer and/or lime that were used were because explicit georeference and sample age links to the field data are missing. The combination of these factors could influence reported survey results e.g., via background contamination and noise of the obtained grain survey samples, and any other overall sample survey artifacts or treatment differences.

## Sample provenance tagging

The next chunks provides an example of sample provenance tagging and tracking. We differentiate reference samples that were obtained directly from the linked grain and soil observations (crop or stack) on unfertilized fields, which are not CoDa outliers from all other survey samples (i.e., those stored and/or from fertilized samples and/or outliers). Note that you can easily change and/or expand the rule(s) for this. However you define *"reference samples"*, these should be your highest-value targets in future surveys because they define a consistent field baseline for monitoring any diagnostic changes. Also note that these landmark targets are hard to collect in the field because they occur only, and have to be sampled, during a quite narrow time window during any given cropping season. This next chunk tags the reference samples and produces a sketch map of where those samples were located in the currently available survey.

```{r}
# Tag reference samples with <dplyr>
site <- site %>% mutate(ref = case_when(source == 'crop' | source == 'stack' & outlier == 'n'
                                        & fert == 'n' ~ 'y', TRUE ~ 'n'))

# Reference sample/site file write out
write.csv(site, "./MW_coda/Results/MW_ref_coda.csv", row.names = F)

# Reference sample location map
rloc <- site[ which(site$ref == 'y'), ]

r <- leaflet() %>%
  setView(lng = mean(rloc$lon), lat = mean(rloc$lat), zoom = 7) %>%
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addCircleMarkers(rloc$lon, rloc$lat, clusterOptions = markerClusterOptions())
r ## plot widget 
```

## Sample provenance tracking

You could track and monitor your reference samples with the following short, exploratory machine learning workflow. We initially split the data into calibration and validation sets and select the relevant features for distinguishing between `ref = y` or `ref = n` samples. This could also be readily updated with new data as (i.e., from new locations and/or from repeat samples), become available over time.

```{r}
# Set calibration/validation set randomization seed
seed <- 123581
set.seed(seed)

# Split data into calibration and validation sets
gsIndex <- createDataPartition(site$ref, p = 4/5, list = F, times = 1)
gs_cal <- site[ gsIndex,]
gs_val <- site[-gsIndex,]

# Specify sample calibration labels
labs <- c("ref")
lcal <- as.vector(t(gs_cal[labs]))

# Specify calibration features
fcal <- gs_cal[ ,c(3:4,11:25)] ## selects lat/lon & CoDa features
```

The following `caretEnsemble` chunk then generates the reference sample / site predictions on the validation set with default-tuning of the relevant model [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)). Note that if you would like more control over the individual model fits, you can also use their respective `tuneList` arguments.

```{r, warning = FALSE, results='hide'}
# Start doParallel to parallelize model fitting
seed <- 123581
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Specify model training controls
tc <- trainControl(method = "cv", number = 10, classProbs = T,
                   summaryFunction = twoClassSummary, allowParallel = TRUE,
                   savePredictions = "final")

# Fit 5 calibration models using all of the CoDa features
clist <- caretList(fcal, lcal,
                   trControl = tc,
                   tuneList = NULL,
                   methodList = c("glmStepAIC", "rf", "xgbTree", "nnet", "nb"),
                   metric = "ROC")

gs_val$gl <- predict(clist$glmStepAIC, gs_val, type = 'prob')$y
gs_val$rf <- predict(clist$rf, gs_val, type = 'prob')$y
gs_val$xt <- predict(clist$xgbTree, gs_val, type = 'prob')$y
gs_val$nn <- predict(clist$nnet, gs_val, type = 'prob')$y
gs_val$nb <- predict(clist$nb, gs_val, type = 'prob')$y
stopCluster(mc)
fname <- paste("./MW_coda/Results/", labs, "_clist.rds", sep = "")
saveRDS(clist, fname)
```

The next chunk then stacks the 5 previous models, the so-called [base learners](https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205), on the validation set. We will use `glmStepAIC` function for model selection and weighting, but you may want to try other [top learners](https://en.wikipedia.org/wiki/Ensemble_learning) in this context as well.

```{r, results = 'hide'}
# Set validation labels and features
lval <- as.vector(t(gs_val[labs])) ##  subset validation labels
fval <- gs_val[ ,34:38] ## subset validation features

# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Control setup
set.seed(1385321)
tc <- trainControl(method = "cv", number = 10, classProbs = T,
                   summaryFunction = twoClassSummary, allowParallel = TRUE,
                   savePredictions = "final")

# Model training
st <- train(fval, lval,
            method = "glmStepAIC",
            family = "binomial",
            metric = "ROC",
            trControl = tc)

# Model outputs & predictions
gs_val$st <- predict(st, gs_val, type = 'prob')$y ## stacked predictions on the validation set
stopCluster(mc)
fname <- paste("./MW_coda/Results/", labs, "_st.rds", sep = "")
saveRDS(st, fname)
```

The next chunks generate the stacked predictions (`st`) on the full `site` dataset and compares those to the currently available, field-based provenance tags.

```{r}
# Generate predictions on overall dataset
site$gl <- predict(clist$glmStepAIC, site, type = 'prob')$y
site$rf <- predict(clist$rf, site, type = 'prob')$y
site$xt <- predict(clist$xgbTree, site, type = 'prob')$y
site$nn <- predict(clist$nnet, site, type = 'prob')$y
site$nb <- predict(clist$nb, site, type = 'prob')$y
site$st <- predict(st, site)
```

```{r}
# Confusion matrix
confusionMatrix(site$st, as.factor(site$ref))
```

The fact that we can readily differentiate `ref = y` from `ref = n` samples with high [sensitivity and specificity](https://en.wikipedia.org/wiki/Sensitivity_and_specificity) in this survey provides a good indication that we should probably be dealing with at least two CoDa measurement sub-populations based on their provenance tags (i.e., field vs store from either fertilized or unfertilized fields). Given the sample geographical locations and their associated CoDa, the samples are clearly compositionally and/or geographically separable. Note that these differences were not considered in the [Gashu et al., (2021)](https://doi.org/10.1038/s41586-021-03559-3) article, and may have distorted the Malawi results presented therein without regard to any of the underlying survey artifacts.

# Univariate survey post-stratification

This section provides initial univariate data model summaries of parts of the CLR transformed data by crop species. We focus on the contrast of the macro / micro nutrient subcompositions between crops. This can be readily extended to other compositional amalgamations. These next chunks fit the differences between the individual macro nutrient CLRs of millet, rice and sorghum as compared to maize, post-stratified for sample provenance (`ref`) and dominant archetype (`DA`). We will be doing this using mixed-effects (random intercept) regression with the [`arm`](https://www.rdocumentation.org/packages/arm/versions/1.11-2) package.

```{r, results = 'asis'}
# Fit nutrients by crop species with mixed-effects regression
m0 <- lmer(macro~crop + (1|ref) + (1|DA), data = site)
m1 <- lmer(Na~crop + (1|ref) + (1|DA), data = site)
m2 <- lmer(Mg~crop + (1|ref) + (1|DA), data = site)
m3 <- lmer(P~crop + (1|ref) + (1|DA), data = site)
m4 <- lmer(S~crop + (1|ref) + (1|DA), data = site)
m5 <- lmer(K~crop + (1|ref) + (1|DA), data = site)
m6 <- lmer(Ca~crop + (1|ref) + (1|DA), data = site)

# Generate regression summary table
stargazer::stargazer(m0, m1, m2, m3, m4, m5, m6, type = "html", 
                     ci = TRUE, ci.level = 0.95, intercept.bottom = FALSE, digits = 1, 
                     omit.table.layout = "sn", model.numbers = FALSE,
                     column.sep.width = "24pt",
                     covariate.labels = c("Intercept","millet","rice","sorghum"),
                     title = "Macro-nutrient regression summary using maize as the reference crop.")
```
\

Based on this, there appear to be clear differences between the macro compositional parts of the 4 cereal crops covered by this survey. Maize appears to have a higher macro-nutrient proportion than millet, rice or sorghum. Conversely, because of the binary macro / micro contrast that is imposed here, the proportion of total micro-minerals would be expected to be proportionally higher in millet, rice and sorghum as compared to maize. The corresponding micro-nutrient summary is shown in the table below.

```{r, echo = FALSE, results = 'asis'}
# Fit macro-nutrients by crop species and fertilizer use
m7 <- lmer(Cr~crop + (1|ref) + (1|DA), data = site)
m8 <- lmer(Mn~crop + (1|ref) + (1|DA), data = site)
m9 <- lmer(Fe~crop + (1|ref) + (1|DA), data = site)
m10 <- lmer(Co~crop + (1|ref) + (1|DA), data = site)
m11 <- lmer(Cu~crop + (1|ref) + (1|DA), data = site)
m12 <- lmer(Zn~crop + (1|ref) + (1|DA), data = site)
m13 <- lmer(Se~crop + (1|ref) + (1|DA), data = site)
m14 <- lmer(Mo~crop + (1|ref) + (1|DA), data = site)
m15 <- lmer(I~crop + (1|ref) + (1|DA), data = site)

# Generate regression summary table
stargazer::stargazer(m7, m8, m9, m10, m11, m12, m13, m14, m15, type = "html", 
                     ci = TRUE, ci.level = 0.95, intercept.bottom = FALSE, digits = 1, 
                     omit.table.layout = "sn", model.numbers = FALSE,
                     column.sep.width = "24pt",
                     covariate.labels = c("Intercept","millet","rice","sorghum"),
                     title = "Micro-nutrient regression summary using maize as the reference crop.")
```
\

# Graphical models of reference sample nutrient interactions

So far, we have largely been dealing with univariate CoDa models, as is common practice. The aim of this next section is to illustrate an empirical multivariate basis that describes the relevant conditional relationships between the different CoDa parts with graphical interaction models. The types of models that we will be using are well described and illustrated in [Højsgaard, Edwards and Lauritzen (2012)](https://link.springer.com/book/10.1007/978-1-4614-2299-0) including the code and references therein. There is also an open R tutorial by [Højsgaard (2012)](https://people.math.aau.dk/~sorenh/misc/2012-Oslo-GMwR/GMwR-notes.pdf) available, which you might want to consult.

## Package installs

To run this section of the notebook you will initially need to install the relevant packages. This is a bit of an art form depending on the version of R that you are currently running. Both the [`gRim`](https://cran.r-project.org/web/packages/gRim/index.html) and [`statnet`](https://statnet.org/) packages that we shall be using for these analyses and their visualization have dependencies to other packages e.g., [`bioconductor`](http://www.bioconductor.org/install/), which can be a bit tricky depending on their versioning cycle on [CRAN](https://cran.r-project.org/). The next chunks are what worked for us using our current `R.Version()`, but this is likely to change and you may need to adjust accordingly depending on your setup.

```{r, results = 'hide'}
# Install <bioconductor> first ... see: http://www.bioconductor.org/install/
# do this only once
# if (!requireNamespace("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
# BiocManager::install(version = "3.13")

# note that this will query you in the console as to whether to update other associated packages ... (a/s/n). Reply with n if all of the listed packages are up to date.

# Then install and load gRbase, gRim and statnet 
packages <- c("gRbase", "gRim", "statnet")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))
```

Similarly, the [`statnet`](https://www.rdocumentation.org/packages/statnet/versions/2019.6) package, which we will use for graphical model visualization currently may have some specific installation requirements that you can track on their [Github repository](https://statnet.org/).

## Fitting graphical interaction models

The first graphical model that we fit provides a baseline for describing nutrient interactions by crop species for provenance reference (`ref = y`) sites. We start with a model that initially contains no interactions and then adds only the most significant ones sequentially (i.e., with [Foward selection](https://jmlr.csail.mit.edu/papers/volume20/17-334/17-334.pdf)).

```{r, results = 'hide'}
# Select reference sites
refs <- site[ which(site$ref=='y'), ]

# CoDa interactions by crop type
coda <- site[ ,c(5,11:25)]

# Forward selection
mm1 <- mmod(~.^1, data = coda)
mm2 <- forward(mm1)
mmf2 <- ugList(terms(mm2), result="matrix")
net <- as.network(x = mmf2, directed = FALSE, loops = FALSE, matrix.type = "adjacency")
```

This next chunk produces the associated interaction graph. Note that this represents the maximal interaction graph for the reference data.

```{r, fig.align = "center", fig.height = 4, fig.width = 4, fig.cap = "Baseline CoDa nutrient interaction graph by crop species for reference samples."}

par(mar=c(1,1,1,1))
plot.network(net, vertex.col = "light grey", vertex.cex = 6, displaylabels = T, label.pos = 5, mode = "circle")
```

You can also obtain the [Maximal cliques](https://www.sciencedirect.com/topics/mathematics/maximal-clique) of this graph. In [Graph theory](https://en.wikipedia.org/wiki/Graph_theory) and in statistics, a clique is a complete subset of a graph, and a maximal clique is a clique which cannot be enlarged. In this particular case, *"maximal cliques"* can be interpreted as specific grain nutrient subcompositions that may warrant further exploration for any subsequent diagnostic modeling.

```{r}
getCliques(mmf2)
```

We can fit a similar model by separating the interactions by macro and micro-nutrient sub-compositions, by crop species. The `mm3` base model specification sets that up. We can then use [Backward selection](https://jmlr.csail.mit.edu/papers/volume20/17-334/17-334.pdf) to retain only the most significant interactions in the corresponding `mm4` fit, which also produces the associated graph.

```{r, results = 'hide'}
# Marco and micro-nutrient interactions by crop type
mm3 <- mmod(~crop:Na:Mg:P:S:K:Ca+crop:Cr:Mn:Fe:Co:Cu:Zn:Se:Mo:I, data = coda)
mm4 <- backward(mm3)
mmf4 <- ugList(terms(mm4), result="matrix")
net <- as.network(x = mmf4, directed = FALSE, loops = FALSE, matrix.type = "adjacency")
```

```{r, echo = FALSE, fig.align = "center", fig.height = 4, fig.width = 4, fig.cap = "CoDa macro and micro-nutrient interaction graph by crop species for reference samples."}

par(mar=c(1,1,1,1))
plot.network(net, vertex.col = "light grey", vertex.cex = 6, displaylabels = T, label.pos = 5, mode = "circle")
```

Under the given model assumptions, the corresponding maximal cliques (sub-compositions) are identified as:

```{r, echo = FALSE}
getCliques(mmf4)
```

Note that this halves the number of sub-compositions that might need to be considered relative to the baseline (`m2`) model, and therefore could be considered as a minimal CoDa interaction model relative to the main cereal crops that are reported in Malawi.

# Takeways

The main takeaways from this notebook are the following:

* Converting nutrient concentrations to multivariate logratios is a simple step, but unlike other data transformations it preserves the integrity of the relevant compositional parts that are needed to normalize the data for most subsequent univariate, multivariate and/or inferential analyses and the associated statistics. Also, CoDA estimates can always be back-transformed into their original units given total concentration sums.

* We think that it may be more informative to analyze mineral nutrient concentration data as CoDa. The associated LR conversions (ALR, CLR, ILR) do not make EDAs any more computationally difficult or time consuming, but they should improve any subsequent data modeling, geostatistical and/or machine learning based predictions in instances where subcompositional closure principles (*sensu* [Aitchison geometry](https://laboratoriomatematicas.uniandes.edu.co/cursocoda/04-Vera-geometry.pdf)) apply.

* Automated, multivariate anomaly and outlier detection is advancing with the application of [Isolation principles](https://ieeexplore.ieee.org/document/4781136) to machine learning. We shall explore this topic further in additional notebooks.

* Clustering the CoDa nutrient data in fewer compositional dimensions e.g., via archetypal analysis, provides a simple framework and generalization for post-stratifying nutrient survey results that should improve the precision and accuracy of national cereal grain mineral nutrient estimates in Malawi. They also represent an alternate means for closing and normalizing CoDa. Other clustering, standard [Ordination techniques](https://en.wikipedia.org/wiki/Ordination_(statistics)) and genotype × environment interaction ([GGE](https://cran.r-project.org/web/packages/gge/gge.pdf)) approaches could and should also be explored in this context (see [Greenacre, 2021](https://www.annualreviews.org/doi/pdf/10.1146/annurev-statistics-042720-124436)).

* Reference sample tagging and tracking can help with locating where and under which circumstances the CoDa were obtained, and identifies high-value targets for future surveys (*sensu*, their [provenance](https://en.wikipedia.org/wiki/Provenance)). This also broaches the subject of generating deployment-ready, spectral and/or spatial nutrient predictions using machine learning and/or geostatistics. These additional topics will be covered in more detail in subsequent notebooks. The main lesson is to pay close(r) attention to the collection and interpretation of *provenance information* in future surveys!

* Graphical interaction models can help to clarify the conditional relationships between compositional parts conditioned on observations and measurements. They can also be used to select, compositionally close, model and compare different sub-compositions, including those guided by domain experts. This is **our preferred EDA technique** for developing e.g., reproducible soil and/or spatial diagnostics of potential nutrient deficiencies ... and data permitting.

Any questions, comments or corrections about this notebook are most welcome via [AFSIS](mailto:mgw.africasoils.info).