Spatial_preds_crop_ionomes.Rmd

---
title: Spatial predictions of cereal grain ionomes from Ethiopia and Malawi
author: M.G. Walsh, D.B. Kumssa, A.R. Mossa, T. Amede, P.C. Nalivata, D. Gashu, S.M. Haefele, S. Gameda, M.R. Broadley, E.J.M. Joy, E.H. Bailey, R.M. Lark, B.A. Walsh, R. Rodríguez Iglesias, J. Chamberlin and S. Snapp
date: "`r format(Sys.time(), '%d, %B, %Y')`"
output:
  html_document:
    toc: true
    toc_depth: 2
    fig_caption: true
    keep_md: true
    number_sections: true
    css: style1.css
---

```{r setup, echo = FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```

# Introduction

Models of trophic dynamics in spatial ecology generally assume no interactions between the essential nutrient elements and describe the associated transfer processes as a function of only the potentially most limiting parts (i.e., [Liebig's law](https://en.wikipedia.org/wiki/Justus_von_Liebig)). However, the "*Law of the Minimum*" does not hold consistently at landscape scales (e.g., [Sinclair and Park, 1993](https://doi.org/10.2134/agronj1993.00021962008500030040x)), nor at different [trophic levels](https://en.wikipedia.org/wiki/Trophic_level). Ecological theory has not yet been able to accurately characterize and predict the underlying spatial nutrient transfer relationships, and their associated fluxes across food system boundaries. So, understanding the statistical nature of both survey and experimental results remain important undertakings for managing crop, animal, and human nutrition, as well as for the conservation of essentially [non-renewable](https://en.wikipedia.org/wiki/Non-renewable_resource) resources such as productive soils and potable water.

<br>
```{r, echo=FALSE, fig.align="center", fig.cap="**Fig. 1**: The main trophic pathways in agricultural systems within a pixel or voxel. Note that the dashed arrow (from Food crops to Livestock) flags one potentially pathological trophic pathway that can occur in some food systems.", out.width = '60%'}

knitr::include_graphics("trophic_levels.jpg")
```

[Ionomics](https://www.sciencedirect.com/topics/agricultural-and-biological-sciences/ionomics) is the study of the *ionome* that encompasses the quantitative and simultaneous measurement of the elemental composition of a particular organism or its tissues (see [Salt, Baxter and Lahner, 2008](https://www.annualreviews.org/doi/pdf/10.1146/annurev.arplant.59.032607.092942)). In agricultural systems, ionomes are associated with both crop yields as well as the overall nutritional quality of the plant products that are subsequently consumed by animals, including humans. "*What this means is that for people consuming food sourced locally, as is the case for many smallholder farming communities, the location of residence will be a major, sometimes the largest, influencing factor in determining the dietary intake of micronutrients from cereals* ([Kumssa et al., 2022](https://www.nature.com/articles/s41597-022-01500-5))."

<br>
```{r, echo=FALSE, fig.align="center", fig.cap="**Fig. 2**: Are supplements readily available in your neighborhood? ... in rural Malawi and Ethiopia they are generally not! Also note the big list of ingredients that go into the supplement, which can generally not be produced locally in Africa.", out.width = '40%'}

knitr::include_graphics("supplement.jpg")
```

[Compositional data](https://doi.org/10.3389/fpls.2013.00039) profiles such as those shown in Fig. 2 present a special class of multivariate [labeling](https://en.wikipedia.org/wiki/Labeled_data) problems for data mining, machine learning, prediction, monitoring and [recommender systems](https://en.wikipedia.org/wiki/Recommender_system) because they contain multiple labels that may be interdependent. Another important characteristic of ionomic data is that they are usually defined as vectors with strictly positive components whose sum is constant (e.g, 1 or 100%). Both these characteristics are solvable from a machine learning perspective, but they have not been adequately tested in terms of their potential influence in describing and predicting the associated soil-crop trophic interactions and their potential nutritional outcomes for people (and livestock).

This notebook examines spatial predictions associated with mineral nutrient deficiency risks for humans using cereal grain data from Ethiopia and Malawi. The data we shall be using were generated by the [GeoNutrition Project](http://www.geonutrition.com). Nationally representative cropland sampling frames and the laboratory methods that were used in the data collections are well described in [Gashu et al., (2021)](https://doi.org/10.1038/s41586-021-03559-3). An overall data descriptor article provides additional information and the raw data downloads at: [Kumssa et al., (2022) ](https://www.nature.com/articles/s41597-022-01500-5). The notebook itself is maintained on GitHub ([here](https://github.com/mgwalsh/GeoNutrition/blob/main/Spatial_preds_crop_ionomes.Rmd)). Feel free to alter and use it as you see fit, but also please cite it when you do so. It is copyrighted as [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).

# Data setup

To actually run this notebook, you will need to load the packages indicated in the chunk below.

```{r, results = 'hide'}
# Package names
packages <- c("tidyverse", "rgdal", "sp", "raster", "leaflet", "htmlwidgets", "DT", "compositions",
              "doParallel", "caret", "caretEnsemble", "quantreg")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))
```

The primary data used in this notebook include *Inductively Coupled Plasma Mass Spectrometry* ([ICP-MS](https://en.wikipedia.org/wiki/Inductively_coupled_plasma_mass_spectrometry)) mineral nutrient concentration measurements of cereal grain samples (in mg kg^-1^), including: Na, Mg, P, K, Ca, V, Cr, Mn, Fe, Co, Cu, Zn, Se, Mo. They also include the georeference of where the samples were collected (lon/lat, EPSG:3857), the cereal crop that was sampled at that location, and the source of the grain samples (i.e., from the standing crop, a field stack or a closely located grain store). They also include the gridded spatial variables we shall be using for predictive modeling (see their short description and raw data links in the table below).

<br>
```{r, echo = FALSE}
vars <- read.table("grids.csv", header = T, sep = ",")
datatable(vars)
```

<br>
You can download all of the data needed to run this notebook from our [GeoNutrition](https://osf.io/47t3u/) OSF repository using the following links to: [`MW_ET_grain_ICP.csv`](https://osf.io/cjfbs), [`ET_grids_250m.zip`](https://osf.io/827eb) and [`MW_grids_250m.zip`](https://osf.io/fvzh3). Place the 3 files into your working directory. Note that the `ET_grids_250.zip` download is relatively large (> 1 Gb), and may take a bit of time to transfer, depending on your connection. The next chunks load and merge the GeoNutrition survey data with the spatial (gridded) features.

```{r}
icpdat <- read.table("MW_ET_grain_ICP.csv", header=T, sep=",") ## ICP-MS data for ET & MW
et_icp <- icpdat[which(icpdat$survey == 'ET'), ]

# Unzip Ethiopia grid files into a sub-directory called "ET_grids"
dir.create("ET_grids", showWarnings=F)
unzip("ET_grids_250m.zip", exdir = "ET_grids", overwrite = T)

# Load rasters
et_lis <- list.files(path="./ET_grids", pattern="tif", full.names = T)
et_gri <- stack(et_lis)

# Ethiopia survey projection
et_pro <- as.data.frame(project(cbind(et_icp$lon, et_icp$lat), "+proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5
                                +units=m +no_defs"))

colnames(et_pro) <- c("x","y")
et_icp <- cbind(et_icp, et_pro)
coordinates(et_icp) <- ~x+y
projection(et_icp) <- projection(et_gri)

# Extract gridded variables at Ethiopia survey locations
et_grid <- extract(et_gri, et_icp)
et_icp <- as.data.frame(cbind(et_icp, et_grid))
```

Apply the same procedure to the Malawi GeoNutrition data.

```{r}
mw_icp <- icpdat[which(icpdat$survey == 'MW'), ]

# Unzip Malawi grid files into a sub-directory called "MW_grids"
dir.create("MW_grids", showWarnings=F)
unzip("MW_grids_250m.zip", exdir = "MW_grids", overwrite = T)

# Load rasters
mw_lis <- list.files(path="./MW_grids", pattern="tif", full.names = T)
mw_gri <- stack(mw_lis, quick = TRUE)

# Malawi survey projection
mw_pro <- as.data.frame(project(cbind(mw_icp$lon, mw_icp$lat), "+proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5
                                +units=m +no_defs"))

colnames(mw_pro) <- c("x","y")
mw_icp <- cbind(mw_icp, mw_pro)
coordinates(mw_icp) <- ~x+y
projection(mw_icp) <- projection(mw_gri)

# Extract gridded variables at Malawi survey locations
mw_grid <- extract(mw_gri, mw_icp)
mw_icp <- as.data.frame(cbind(mw_icp, mw_grid))
```

This chunk then merges and writes out a clean, combined data frame for Ethiopia and Malawi that can be reused should you wish to process it in software other than R. You can also download the file directly from our OSF site [here](https://osf.io/cjfbs).

```{r}
# Merge the Ethiopia and Malawi data
icpdat <- rbind(et_icp, mw_icp)

# Write-out main data frame
dir.create("Results", showWarnings=F)
write.csv(icpdat, "./Results/ET_MW_icpdat.csv", row.names = F)
glimpse(icpdat)
```

This next chunk shows where the grain samples were collected. You can zoom into the map to display the distribution of the individual survey locations. Clicking on individual markers will indicate which crop species were collected at that location. The red points are maize (the most frequent and designated reference crop in these surveys); blue dots indicate other crops.

```{r}
col <- ifelse(icpdat$crop == "maiz", "red", "blue")

# Plot sample locations
w <- leaflet() %>%
  setView(lng = mean(icpdat$lon), lat = mean(icpdat$lat), zoom = 4) %>%
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addCircleMarkers(icpdat$lon, icpdat$lat,
                   popup = icpdat$crop,
                   color = col,
                   stroke = FALSE,
                   fillOpacity = 0.8,
                   radius = 6,
                   clusterOptions = markerClusterOptions())

saveWidget(w, 'MW_ET_ICP_sample_locs.html', selfcontained = T) ## save widget
w ## plot widget
```

# Exploratory data analyses

## Compositional data transforms

*Compositional Data Analysis* ([Aitchison, 1986](http://www.leg.ufpr.br/lib/exe/fetch.php/pessoais%3Aabtmartins%3Aa_concise_guide_to_compositional_data_analysis.pdf)) refers to the analyses of *compositional data* (CoDa). CoDa are defined as vectors with strictly positive components whose sum is constant (e.g, 1 or 100%). The terms also cover all other *parts* of a whole (so-called *subcompositions*), which carry only relative information ([van den Boogaart and Tolosana-Delgado, 2013](https://link.springer.com/book/10.1007/978-3-642-36809-7)). Because a change in any proportion of a whole changes at least one other proportion, CoDa are interdependent and should be log ratio transformed (see e.g., [Parent et al., 2013](https://cdn.intechopen.com/pdfs/41378/InTech-Nutrient_balance_as_paradigm_of_plant_and_soil_chemometricsnutrient_balance_as_paradigm_of_soil_and_plant_chemometrics.pdf)). 

The next chunk [closes](https://econ-papers.upf.edu/papers/1554.pdf) the CoDa and calculates *centered log ratios* ([clr](https://www.rdocumentation.org/packages/compositions/versions/2.0-1/topics/clr)). These can be analyzed, summarized, and interpreted with conventional univariate, multivariate, data and machine learning modeling methods ([Greenacre, 2021](https://www.annualreviews.org/doi/pdf/10.1146/annurev-statistics-042720-124436)). Note that you could also use other log ratios such as the [alr](https://www.rdocumentation.org/packages/compositions/versions/2.0-1/topics/alr), or the [ilr](https://www.rdocumentation.org/packages/compositions/versions/2.0-1/topics/ilr), but we shall leave those possibilities for you to explore.

```{r}
# Calculate the clr
varc <- c("Na", "Mg", "P", "K", "Ca", "V", "Cr", "Mn", "Fe", "Co", "Cu", "Zn", "Se", "Mo")
comps <- icpdat[varc]
comps <- comps / rowSums(comps) ## close the composition
comps <- as.data.frame(clr(comps)) ## centered log ratio (clr) transform

# Attach survey data info
locv <- c("survey", "x", "y", "crop", "source", "albe", "bio1", "bio12", "bio15", "bio7", "bp", 
          "cec", "cp", "dows", "lstd", "lstn", "mdem", "mlat", "mlon", "para", "parv", "ph",
          "slope", "snd", "soc", "twi", "wp") 
locs <- icpdat[locv]
coda <- cbind(locs, comps)
coda <- na.omit(coda) ## includes only complete cases

# Scrub extraneous objects in memory
rm(list=setdiff(ls(), c("icpdat", "coda", "et_gri", "mw_gri")))
```

```{r, echo = FALSE, fig.align = "center", fig.show = 'hold', fig.height = 12, fig.cap = "**Fig. 3**: Boxplots of ionomic differences between cereal grains from Ethiopia and Malawi (on a centered log ratio scale)."}

# Reorder crop species
coda$crop <- factor(coda$crop, levels=c('maiz','sorg','pmil','fmil','teff','rice','whea','barl','trit'))

# Boxplots 
par(mfrow = c(5,3), pty = "s", mar=c(2,4,1,1))
boxplot(coda$Na ~ coda$crop, notch = T, ylab = "Na", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
boxplot(coda$Mg ~ coda$crop, notch = T, ylab = "Mg", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
boxplot(coda$P ~ coda$crop, notch = T, ylab = "P", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
boxplot(coda$K ~ coda$crop, notch = T, ylab = "K", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
boxplot(coda$Ca ~ coda$crop, notch = T, ylab = "Ca", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
boxplot(coda$V ~ coda$crop, notch = T, ylab = "V", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
boxplot(coda$Cr ~ coda$crop, notch = T, ylab = "Cr", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
boxplot(coda$Mn ~ coda$crop, notch = T, ylab = "Mn", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
boxplot(coda$Fe ~ coda$crop, notch = T, ylab = "Fe", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
boxplot(coda$Co ~ coda$crop, notch = T, ylab = "Co", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
boxplot(coda$Cu ~ coda$crop, notch = T, ylab = "Cu", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
boxplot(coda$Zn ~ coda$crop, notch = T, ylab = "Zn", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
boxplot(coda$Se ~ coda$crop, notch = T, ylab = "Se", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
boxplot(coda$Mo ~ coda$crop, notch = T, ylab = "Mo", xlab = "", cex.axis = 1, cex.lab = 1.3, las = 2)
```

## Ionomic profiles of cereal crops

We run a compositional main effects regression ([van den Boogaart and Tolosana-Delgado, 2013](https://link.springer.com/book/10.1007/978-3-642-36809-7)) using maize as the control crop. This quantifies the average ionomic differences between crop species (and their variability), which are shown in Fig. 3 and that can be used as a baseline for the spatial models that are presented in the subsequent sections. Section 4.5 augments this initially simple model with spatial predictions.

```{r}
icpdat$crop <- factor(icpdat$crop, levels=c('maiz','sorg','pmil','fmil','teff','rice','whea','barl','trit'))
y = acomp(icpdat[,10:23])
covars <- icpdat[,c("survey", "crop")]
survey <- covars$survey
crop <- covars$crop

# Ionomic differences between surveys and crop species
survey.lm <- lm(clr(y) ~ survey + crop)
survey.lm
```

# Spatial predictions

> *"The [**Rashomon effect**](https://en.wikipedia.org/wiki/Rashomon_effect) is a storytelling and writing method in cinema in which an event is given contradictory interpretations or descriptions by the individuals involved, thereby providing different perspectives and points of view of the same incident. The term, derived from the 1950 film Rashomon (directed by Akira Kurosawa), is used to describe the phenomenon of the unreliability of eyewitnesses."* -- also see: [Anderson (2016)](https://cjc.utpjournals.press/doi/10.22230/cjc.2016v41n2a3068).

We will be using a stacking workflow for spatial predictions. Stacking is an ensemble machine learning technique that combines multiple base models to achieve better overall performance in predictive modeling tasks. The primary aim of stacking is to improve the generalization capability of individual models by leveraging the strengths of diverse models to reduce prediction errors on unseen/test data ([Wolpert, 1992](http://machine-learning.martinsewell.com/ensembles/stacking/Wolpert1992.pdf)).

The core idea of stacking is to train multiple base models on the same training dataset and then to use their predictions as input features for a second-level model, called the stacked or meta-model. The stacked model is then trained to make the final prediction by combining the outputs of the base models ([Breiman, 1996](https://link.springer.com/content/pdf/10.1007/BF00058655.pdf). This two-level prediction process can also be extended to additional levels if needed. The basic stacking process is as follows:

1. Data splitting: The original training dataset is split into a training set and a validation set. This split can be done using techniques like k-fold cross-validation.

2. Base model training: Multiple base models are trained on the training set, which could include different algorithms, different configurations of the same algorithm, or feature data.

3. Base model predictions: The trained base models make predictions on the validation set, generating a new set of features for the stacked model.

4. Stacked model training: The ensemble model is trained on the new set of features generated by the base model predictions, learning to combine them optimally.

5. Final predictions: The ensemble makes its final predictions by feeding the test data through the base models and then through the stacked model.

The stacking process is especially effective when the base models have complementary strengths and weaknesses. By combining diverse models, the ensemble can exploit their individual strengths and mitigate their weaknesses, minimizing overfitting (or underfitting), resulting in improved overall performance on unseen data. This is very similar to guarding against bias and combining the predictions of a roomful of witnesses all of whom have very different recollections, opinions, and mental models about the same incident … hence, the “Rashomon Effect” analogy above.

## Training / testing setup

We initially set a 90/10% calibration and validation partition. The function `createDataPartition` in the [`caret`](https://topepo.github.io/caret/index.html) package can be used to generate balanced splits of the data. If the argument to this function is a factor, random sampling occurs within each class and will preserve the overall class balance of the data. We partition on the `survey` variable here. 

```{r}
# Set calibration/validation set randomization seed
seed <- 12358
set.seed(seed)
coda <- coda[sample(1:nrow(coda)), ] # randomize observations

# Split data into calibration and validation sets by survey
gnIndex <- createDataPartition(coda$survey, p = 9/10, list = F, times = 1)
gn_cal <- coda[ gnIndex,]
gn_val <- coda[-gnIndex,]

# Calibration labels
labs <- c("Ca") ## insert other labels here 
lcal <- as.vector(t(gn_cal[labs]))

# Raster calibration features
fcal <- gn_cal[ ,c(6:27)]
```

## Spatial base learner training and prediction

This chunk fits 2 base-learner models (... one bagging, one boosting), which uses gridded training data with 10-fold cross-validation. You can learn more about how these algorithms work by looking up the links at: [randomForest](https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForest) and [gbm](https://www.rdocumentation.org/packages/gbm/versions/2.1.8.1). Change these as you see fit. There are about 160 different algorithms available via the `caret` package.

You can also use `caretEnsemble` package instead of `caret` as long as the feature variables (`et_gri` and `mw_gri` in this case), and the `trainControl` methods are the same for each model in the `caretList` function. This shortens the script-length of this notebook but does not otherwise affect the overall `caret` functionality. Note however that the calculations take a bit of time to run on a normal 8-core, 16 Gb memory computer. We fit these models with default-tuning of the relevant [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_machine_learning).

```{r, warning = FALSE, results='hide'}
# Start doParallel to parallelize model fitting
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Specify model training controls
tc <- trainControl(method="cv", number=10, allowParallel = TRUE, savePredictions="final")

# Fit 2 base-learners using the specified ionomic labels
blist <- caretList(fcal, lcal,
                   trControl = tc,
                   tuneList = NULL,
                   methodList = c("rf", "gbm"),
                   preProcess = c("center","scale"),
                   metric = "RMSE")

stopCluster(mc)
fname <- paste("./Results/", labs, "_blist.rds", sep = "")
saveRDS(blist, fname)

# Generate spatial base-learner predictions for Ethiopia
et_rf_pred <- predict(et_gri, blist$rf)
et_gb_pred <- predict(et_gri, blist$gbm)
et_spreds <- stack(et_rf_pred, et_gb_pred)
names(et_spreds) <- c("rf","gb")
fname <- paste("./Results/", labs, "_ET_base.tif", sep = "")
writeRaster(et_spreds, filename=fname, datatype="FLT4S", options="INTERLEAVE=BAND", overwrite=T)

# Generate spatial base-learner predictions for Malawi
mw_rf_pred <- predict(mw_gri, blist$rf)
mw_gb_pred <- predict(mw_gri, blist$gbm)
mw_spreds <- stack(mw_rf_pred, mw_gb_pred)
names(mw_spreds) <- c("rf","gb")
fname <- paste("./Results/", labs, "_MW_base.tif", sep = "")
writeRaster(mw_spreds, filename=fname, datatype="FLT4S", options="INTERLEAVE=BAND", overwrite=T)
```

## Ensemble learner training and prediction

This next chunk then fits and predicts linear ensemble regressions based on the initial two base-learner models, using the training data.

```{r, results = 'hide'}
# Extract base-learner predictions at the Ethiopia training locations
et_cal <- gn_cal[which(gn_cal$survey == 'ET'), ]
coordinates(et_cal) <- ~x+y
projection(et_cal) <- projection(et_spreds)
et_base <- extract(et_spreds, et_cal)
et_cal <- as.data.frame(cbind(et_cal, et_base))

# Extract base-learner predictions at the Malawi training locations
mw_cal <- gn_cal[which(gn_cal$survey == 'MW'), ]
coordinates(mw_cal) <- ~x+y
projection(mw_cal) <- projection(mw_spreds)
mw_base <- extract(mw_spreds, mw_cal)
mw_cal <- as.data.frame(cbind(mw_cal, mw_base))

# Row bind the Ethiopia and Malawi data frames
gn_cal <- rbind(et_cal, mw_cal)

# Raster calibration features (the base-learners)
lcal <- as.vector(t(gn_cal[labs]))
fcal <- gn_cal[ ,c(40:41)]

# Start doParallel
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Control setup
set.seed(1385321)
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions="final")

# Model training
en <- train(fcal, lcal,
            method = "lm",
            metric = "RMSE",
            trControl = tc)

# Model predictions
et_en_pred <- predict(et_spreds, en) ## ET ensemble spatial predictions
mw_en_pred <- predict(mw_spreds, en) ## MW ensemble spatial predictions
stopCluster(mc)
fname <- paste("./Results/", labs, "_en.rds", sep = "")
saveRDS(en, fname)

# Write out the ensemble-learner prediction grids
et_fname <- paste("./Results/", labs, "_ET_ens.tif", sep = "")
writeRaster(et_en_pred, filename=et_fname, datatype="FLT4S", overwrite=T)
mw_fname <- paste("./Results/", labs, "_MW_ens.tif", sep = "")
writeRaster(mw_en_pred, filename=mw_fname, datatype="FLT4S", overwrite=T)
unlink("./Results/*.xml")
```

To save time we have pre-trained all 28 base and 14 ensemble learners we'll be using for the model testing and prediction rule generation steps below. This is a bit of hack! However, you can download and check all of the associated `.rds` model objects and `.gtif` files from our OSF repository [here](https://osf.io/f4p8g). Figures 4 & 5 below, show the stacked-learner predictions, after applying a few GIS cosmetics in [GRASS](https://grass.osgeo.org/). You can download, examine, and reuse the pre-trained model objects and predictions at https://osf.io/eyda2. Unzip and place those into your `./Results` sub-directory.

<br>
```{r, echo=FALSE, fig.align="center", fig.cap="**Fig. 4**: Compositional mineral nutrient predictions for Malawi (on a centered log ratio scale). Note that the numbers run from red (low), to yellow, to green (high) and have been histogram equalized to emphasize potentially anomalous spatial patterns.", out.width = '80%'}

knitr::include_graphics("MW_coda_maps.jpg")
```

<br>
```{r, echo=FALSE, fig.align="center", fig.cap="**Fig. 5**: Compositional mineral nutrient predictions for Ethiopia (on a centered log ratio scale). The color scheme is as in Fig. 4.", out.width = '80%'}

knitr::include_graphics("ET_coda_maps.jpg")
```

## Prediction uncertainties

The models that have been developed have not seen any of the test-set (evaluation) data up to this point. While there are many ways to quantify the uncertainty inherent in the ensemble predictions, we take a simple but quite robust approach here using quantile regression with ([quantreg](https://cran.r-project.org/web/packages/quantreg/quantreg.pdf)). We are mainly interested in the initial spread of the [ROI-wide](https://en.wikipedia.org/wiki/Region_of_interest) predictions (sensu, their 90% probable intervals). Once new data are collected, we shall be amending these and will also point out some additional (more spatially explicit) approaches. 

```{r, results = 'hide'}
# Stack Ethiopia prediction grids
et_lis <- list.files(path="./Results", pattern="ET_ens.tif", full.names = T)
et_sgrd <- stack(et_lis)
names(et_sgrd) <- c("sCa","sCo","sCr","sCu","sFe","sK","sMg","sMn","sMo","sNa","sP",
                   "sSe","sV","sZn")

# Extract ensemble predictions at the Ethiopia testing locations
et_val <- gn_val[which(gn_val$survey == 'ET'), ]
coordinates(et_val) <- ~x+y
projection(et_val) <- projection(et_sgrd)
et_ens <- extract(et_sgrd, et_val)
et_val <- as.data.frame(cbind(et_val, et_ens))

# Stack Malawi prediction grids
mw_lis <- list.files(path="./Results", pattern="MW_ens.tif", full.names = T)
mw_sgrd <- stack(mw_lis)
names(mw_sgrd) <- c("sCa","sCo","sCr","sCu","sFe","sK","sMg","sMn","sMo","sNa","sP",
                   "sSe","sV","sZn")

# Extract ensemble predictions at the Malawi testing locations
mw_val <- gn_val[which(gn_val$survey == 'MW'), ]
coordinates(mw_val) <- ~x+y
projection(mw_val) <- projection(mw_sgrd)
mw_ens <- extract(mw_sgrd, mw_val)
mw_val <- as.data.frame(cbind(mw_val, mw_ens))

# Row bind the Ethiopia and Malawi test data frames
gn_val <- rbind(et_val, mw_val)

# Quantile regression test-set fits
qNa <- rq(Na~sNa, tau=c(0.05,0.5,0.95), data = gn_val)
qMg <- rq(Mg~sMg, tau=c(0.05,0.5,0.95), data = gn_val)
qP <- rq(P~sP, tau=c(0.05,0.5,0.95), data = gn_val)
qK <- rq(K~sK, tau=c(0.05,0.5,0.95), data = gn_val)
qCa <- rq(Ca~sCa, tau=c(0.05,0.5,0.95), data = gn_val)
qV <- rq(V~sV, tau=c(0.05,0.5,0.95), data = gn_val)
qCr <- rq(Cr~sCr, tau=c(0.05,0.5,0.95), data = gn_val)
qMn <- rq(Mn~sMn, tau=c(0.05,0.5,0.95), data = gn_val)
qFe <- rq(Fe~sFe, tau=c(0.05,0.5,0.95), data = gn_val)
qCo <- rq(Co~sCo, tau=c(0.05,0.5,0.95), data = gn_val)
qCu <- rq(Cu~sCu, tau=c(0.05,0.5,0.95), data = gn_val)
qZn <- rq(Zn~sZn, tau=c(0.05,0.5,0.95), data = gn_val)
qSe <- rq(Se~sSe, tau=c(0.05,0.5,0.95), data = gn_val)
qMo <- rq(Mo~sMo, tau=c(0.05,0.5,0.95), data = gn_val)
```

```{r, echo = F, fig.align = "center", fig.show = 'hold', fig.height = 12, fig.cap = "**Fig. 6**: Quantile regression plots of predicted vs measured test-set values (on a centered log ratio scale). The red lines are estimated at the median, the blue lines are estimated at the 5% and 95% quantiles."}

colors <- c("dodger blue", "tomato", "dodger blue")
par(mfrow = c(5,3), pty = "s", mar=c(2,4,1,1))

plot(Na ~ sNa, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qNa$coefficients)) {
    abline(coef(qNa)[, j], col = colors[j], lwd = 2)}

plot(Mg ~ sMg, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qMg$coefficients)) {
    abline(coef(qMg)[, j], col = colors[j], lwd = 2)}

plot(P ~ sP, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qP$coefficients)) {
    abline(coef(qP)[, j], col = colors[j], lwd = 2)}

plot(K ~ sK, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qK$coefficients)) {
    abline(coef(qK)[, j], col = colors[j], lwd = 2)}

plot(Ca ~ sCa, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qCa$coefficients)) {
    abline(coef(qCa)[, j], col = colors[j], lwd = 2)}

plot(V ~ sV, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qV$coefficients)) {
    abline(coef(qV)[, j], col = colors[j], lwd = 2)}

plot(Cr ~ sCr, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qCr$coefficients)) {
    abline(coef(qCr)[, j], col = colors[j], lwd = 2)}

plot(Mn ~ sMn, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qMn$coefficients)) {
    abline(coef(qMn)[, j], col = colors[j], lwd = 2)}

plot(Fe ~ sFe, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qFe$coefficients)) {
    abline(coef(qFe)[, j], col = colors[j], lwd = 2)}

plot(Co ~ sCo, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qCo$coefficients)) {
    abline(coef(qCo)[, j], col = colors[j], lwd = 2)}

plot(Cu ~ sCu, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qCu$coefficients)) {
    abline(coef(qCu)[, j], col = colors[j], lwd = 2)}

plot(Zn ~ sZn, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qZn$coefficients)) {
    abline(coef(qZn)[, j], col = colors[j], lwd = 2)}

plot(Se ~ sSe, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qSe$coefficients)) {
    abline(coef(qSe)[, j], col = colors[j], lwd = 2)}

plot(Mo ~ sMo, data = gn_val, xlab="", col = "dark grey", cex.lab = 1.3)
for (j in 1:ncol(qMo$coefficients)) {
    abline(coef(qMo)[, j], col = colors[j], lwd = 2)}
```

## Ionomic prediction rules

These final chunks are intended to generate a foundation (at least from the computing side) for the development of a [recommender system](https://en.wikipedia.org/wiki/Recommender_system) that would allow users to make evidence/content based decisions about crop selections, biofortification and/or mineral nutrient supplementation practices in different environments. This is a spatially augmented version of the model that was presented in section 3.2. Also considered now are the dependencies between the individual ionomic components.  

```{r}
# Extract ensemble predictions at the Ethiopia survey locations
et_icp <- icpdat[which(icpdat$survey == 'ET'), ]
coordinates(et_icp) <- ~x+y
projection(et_icp) <- projection(et_sgrd)
et_ens <- extract(et_sgrd, et_icp)
et_icp <- as.data.frame(cbind(et_icp, et_ens))

# Extract ensemble predictions at the Malawi survey locations
mw_icp <- icpdat[which(icpdat$survey == 'MW'), ]
coordinates(mw_icp) <- ~x+y
projection(mw_icp) <- projection(mw_sgrd)
mw_ens <- extract(mw_sgrd, mw_icp)
mw_icp <- as.data.frame(cbind(mw_icp, mw_ens))

# Row bind the Ethiopia and Malawi icp data frames
icpdat <- rbind(et_icp, mw_icp)
write.csv(icpdat, "./Results/ET_MW_icpdat.csv", row.names = F)
```

```{r}
icpdat$crop <- factor(icpdat$crop, levels=c('maiz','sorg','pmil','fmil','teff','rice','whea','barl','trit'))
icpdat <- na.omit(icpdat) ## includes only complete cases
y = acomp(icpdat[,10:23])

# Select covariates
covars <- icpdat[,c("survey", "crop", "sNa", "sMg", "sP", "sK", "sCa", "sV", "sCr", "sMn", "sFe",
                    "sCo", "sCu", "sZn", "sSe", "sMo")]
survey <- covars$survey
crop <- covars$crop
sNa <- covars$sNa
sMg <- covars$sMg
sP <- covars$sP
sK <- covars$sK
sCa <- covars$sCa
sV <- covars$sV
sCr <- covars$sCr
sMn <- covars$sMn
sFe <- covars$sFe
sCo <- covars$sCo
sCu <- covars$sCu
sZn <- covars$sZn
sSe <- covars$sSe
sMo <- covars$sMo

# Fit compositional regression
rules.lm <- lm(clr(y) ~ survey + crop + sNa + sMg + sP + sK + sCa + sV + sCr + sMn + sFe + sCo
                        + sCu + sZn + sSe + sMo)
rules.lm
```

The next chunk transforms the model coefficients back to the original compositional scale. 

```{r}
coefs <- clrInv(coef(rules.lm), orig=y)
coefs
```

# Main takeaways

* The *GeoNutrition* prediction challenge shares features with other environmental risks to humans: (1) nutritional risks from consuming cereal staples are geographically dispersed but have strongly expressed spatial patterns; (2) readily available information exists to predict risky areas; (3) while it is possible to perform staple crop measurements to identify the risks at individual locations, it would be prohibitively expensive and logistically impossible to measure every location and rural household to apply specific mitigation measures such as staple cropping system changes, biofortification and/or supplementation. These need to be modeled, predicted, and monitored.

* "*Predictive accuracy is substantially improved when blending multiple predictors. Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique. Consequently, our solution is an ensemble of many methods.*" ([Bell, Koren and Volinsky, 2007](https://web.archive.org/web/20120304173001/http://www.netflixprize.com/assets/ProgressPrize2007_KorBell.pdf)). This successful strategy by the [Netflix prize](https://en.wikipedia.org/wiki/Netflix_Prize) winners applies here as well using "*out of the box*" feature transformation approaches and machine learning algorithms.

* With monitoring, these types of relatively simple prediction models could be deployed for e.g., estimating nutrient amounts per serving of a given crop and the % daily values similar to those shown on the supplement bottle in Fig. 2 ... but with the respective labels placed and updated on a national (GeoNutrition) map or information service. One additional thing that would be needed in this context would be accurate staple crop and livestock distribution data and maps ... see the notebook for Tanzania and Rwanda, which covers that topic [here](https://africasoils.info/wp-content/uploads/2022/09/Mixed_staples.html).

* Because the current data set is fairly small (by machine learning standards), there are caveats with regard to how well the current predictions generalize to below analytical detection limits, unseen, or unmeasured regions of interest (see e.g., Figs. 5 & 6). Our main intent for this notebook was to provide a computational example of how ionomic information might be gathered more cost-effectively in the future. However, it is also quite encouraging to be able to show that for two countries as environmentally different as Ethiopia and Malawi common, spatial, ionomic relationships and spatial patterns can be learned reliably using our proposed approach.

* Generating more field and laboratory data are not expected to be cost-prohibitive for either national governments or international donors and could be accomplished quickly. In fact, the models presented here should be further exposed to georeferenced validation data. We will provide more information about this and will also suggest "*GeoNutrition*" standard operating procedures in a subsequent notebook.

Any questions, comments or corrections related to this notebook are most welcome via [AFSIS](mailto:mgw.africasoils.info), and we will attempt to include them in any updates.