forked from BlasBenito/spatialRF
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
741 lines (578 loc) · 35 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
---
title: "`spatialRF`: Easy Spatial Regression with Random Forest"
output:
github_document:
toc: true
toc_depth: 2
pandoc_args: --webtex
always_allow_html: yes
---
<!-- badges: start -->
[![R-CMD-check](https://github.com/BlasBenito/spatialRF/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/BlasBenito/spatialRF/actions/workflows/R-CMD-check.yaml) [![Devel-version](https://img.shields.io/badge/devel%20version-1.0.6-blue.svg)](https://github.com/blasbenito/spatialRF) [![lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html) [![CRAN](https://img.shields.io/badge/CRAN-not_published_yet-red)](https://github.com/blasbenito/spatialRF) [![License](https://img.shields.io/badge/license-GPL--3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0.en.html)
<!-- badges: end -->
# Introduction
The package **spatialRF** facilitates fitting spatial regression models on regular or irregular data with Random Forest by generating *spatial predictors* that allow the model to take into account the spatial structure of the training data. The end goal is minimizing the spatial autocorrelation of the model residuals as much as possible.
Two main methods to generate *spatial predictors* from the distance matrix of the data points are implemented in the package:
- Moran's Eigenvector Maps [(Dray, Legendre, and Peres-Neto 2006)](https://www.sciencedirect.com/science/article/abs/pii/S0304380006000925).
- Distance matrix columns as explanatory variables [(Hengl et al. 2018)](https://peerj.com/articles/5518/).
The package is designed to minimize the amount of code required to fit a spatial model from a training dataset, the names of the response and the predictors, and a distance matrix, as the example below shows.
```{r, eval=FALSE}
spatial.model <- rf_spatial(
data = your_dataframe,
dependent.variable.name = "your_response_variable",
predictor.variable.names = c("predictor1", "predictor2", ..., "predictorN"),
distance.matrix = your_distance_matrix
)
```
The package, that uses the `ranger` package under the hood [(Wright and Ziegler 2017)](https://arxiv.org/abs/1508.04409), also provides tools to identify potentially interesting variable interactions, tune random forest hyperparameters, assess model performance on spatially independent data folds, and examine the resulting models via importance plots, response curves, and response surfaces.
# Development
This package is reaching its final form, and big changes are not expected at this stage. However, it has many functions, and even though all them have been tested, only one dataset has been used for those tests. You will find bugs, and something will go wrong almost surely. If you have time to report bugs, please, do so in any of the following ways:
+ Open a new issue in the [Issues GitHub page of the package](https://github.com/BlasBenito/spatialRF/issues).
+ Send me an email explaining the issue and the error messages with enough detail at blasbenito at gmail dot com.
+ Send a direct message to [my twitter account](https://twitter.com/blasbenito) explaining the issue.
I will do my best to solve any issues ASAP!
# Applications
The goal of `spatialRF` is to help fitting *explanatory spatial regression*, where the target is to understand how a set of predictors and the spatial structure of the data influences response variable. Therefore, the spatial analyses implemented in the package can be applied to any spatial dataset, regular or irregular, with a sample size between \~100 and \~5000 cases (the higher end will depend on the RAM memory available), a quantitative or binary (values 0 and 1) response variable, and a more or less large set of predictive variables.
All functions but `rf_spatial()` work with non-spatial data as well if the arguments `distance.matrix` and `distance.thresholds` are ignored. In such case, the number of cases is no longer limited by the size of the distance matrix, and models can be trained with hundreds of thousands of rows.
However, **when the focus is on fitting spatial models**, and due to the nature of the *spatial predictors* used to represent the spatial structure of the training data, **there are many things this package cannot do**:
- Predict model results over raster data.
- Predict a model result over another region with a different spatial structure.
- Work with "big data", whatever that means.
- Imputation or extrapolation (it can be done, but models based on spatial predictors are hardly transferable).
- Take temporal autocorrelation into account (but this is something that might be implemented later on).
If after considering these limitations you are still interested, follow me, I will show you how it works.
# Install
The package is not yet in the CRAN repositories, so at the moment it must be installed from GitHub as follows.
```{r, message = FALSE}
remotes::install_github(
repo = "blasbenito/spatialRF",
ref = "main",
force = TRUE,
quiet = TRUE
)
library(spatialRF)
```
There are a few other libraries that will be useful during this tutorial.
```{r, message = FALSE}
library(kableExtra)
library(rnaturalearth)
library(rnaturalearthdata)
```
# Data requirements
The data required to fit random forest models with `spatialRF` must fulfill several conditions:
+ **The input format is data.frame**. At the moment, tibbles are not fully supported.
+ **The number of rows must be somewhere between 100 and ~5000**, but that will depend on the RAM available in your system. However, this limitation only affects spatial analyses performed with `rf_spatial()`, while all other modeling and plotting functions should work without a distance matrix (if they don't tell me, that'd be a bug!), and therefore analyses in large datasets can still be done with the package.
+ **The number of predictors should be larger than 3**, fitting a Random Forest model is moot otherwise.
+ **Factors in the response or the predictors are not explicitly supported in the package**, they may work, or they won't, but in any case, I designed this package for quantitative data alone.
+ **Must be free of `NA`**. You can check if there are NA records with `sum(apply(df, 2, is.na))`. If the result is larger than 0, then just execute `df <- na.omit(df)` to remove rows with empty cells.
+ **Columns cannot have zero variance**. This condition can be checked with `apply(df, 2, var) == 0`. Columns yielding TRUE should be removed.
+ **Columns must not yield `NaN` or `Inf` when scaled**. You can check each condition with `sum(apply(scale(df), 2, is.nan))` and `sum(apply(scale(df), 2, is.infinite))`. If higher than 0, you can find what columns are giving issues with `sapply(as.data.frame(scale(df)), function(x)any(is.nan(x)))` and `sapply(as.data.frame(scale(df)), function(x)any(is.infinite(x)))`. Any column yielding `TRUE` will generate issues while trying to fit models with `spatialRF`.
# Example data
The package includes an example dataset that fulfills the conditions mentioned above, named [`plant_richness_df`](https://blasbenito.github.io/spatialRF/reference/plant_richness_df.html). It is a data frame with plant species richness and predictors for 227 ecoregions in the Americas, and a distance matrix among the ecoregion edges named, well, [`distance_matrix`](https://blasbenito.github.io/spatialRF/reference/distance_matrix.html).
```{r}
data(plant_richness_df)
data(distance_matrix)
#names of the response variable and the predictors
dependent.variable.name <- "richness_species_vascular"
predictor.variable.names <- colnames(plant_richness_df)[5:21]
```
The response variable of `plant_richness_df` is "richness_species_vascular", with the total count of vascular plant species found on each ecoregion. The figure below shows the centroids of each ecoregion along with their associated value of the response variable.
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.width=6, fig.height=5.5}
world <- rnaturalearth::ne_countries(
scale = "medium",
returnclass = "sf"
)
ggplot2::ggplot() +
ggplot2::geom_sf(data = world, fill = "white") +
ggplot2::geom_point(
data = plant_richness_df,
ggplot2::aes(
x = x,
y = y,
color = richness_species_vascular
),
size = 2.5
) +
ggplot2::scale_color_viridis_c(direction = -1) +
ggplot2::theme_bw() +
ggplot2::labs(color = "Plant richness") +
ggplot2::scale_x_continuous(limits = c(-170, -30)) +
ggplot2::scale_y_continuous(limits = c(-58, 80)) +
ggplot2::ggtitle("Plant richness of the American ecoregions") +
ggplot2::xlab("Longitude") +
ggplot2::ylab("Latitude")
```
The predictors (columns 5 to 21) represent diverse factors that may influence plant richness such as sampling bias, the area of the ecoregion, climatic variables, human presence and impact, topography, geographical fragmentation, and features of the neighbors of each ecoregion. The figure below shows the scatterplots of the response variable (y axis) against each predictor (x axis).
```{r, echo = FALSE, fig.width=10, fig.height=14}
plot.list <- list()
for(variable in predictor.variable.names){
plot.list[[variable]] <- ggplot2::ggplot(
data = plant_richness_df,
ggplot2::aes_string(
x = variable,
y = "richness_species_vascular",
color = "richness_species_vascular"
)
) +
ggplot2::geom_point() +
ggplot2::scale_color_viridis_c(direction = -1) +
ggplot2::theme_bw() +
ggplot2::theme(legend.position = "none")
}
patchwork::wrap_plots(plot.list, ncol = 3)
```
# Finding promising variable interactions
Random Forests already takes into account variable interactions of the form "variable `a` becomes important when `b` is higher than x". However, Random Forest can also take advantage of variable interactions of the form `a * b`, as they are commonly defined in regression models.
The function [`rf_interactions()`](https://blasbenito.github.io/spatialRF/reference/rf_interactions.html) tests all possible interactions among predictors by using each one of them in a separate model, and suggesting the ones with the higher potential contribution to the model's R squared and the higher relative importance (presented as a percentage of the maximum importance of a variable in the model).
```{r, fig.width=8, fig.height=4}
interactions <- rf_interactions(
data = plant_richness_df,
dependent.variable.name = dependent.variable.name,
predictor.variable.names = predictor.variable.names
)
```
Here `rf_interactions()` suggests several candidate interactions ordered by their impact on the model. The function cannot say whether an interaction *makes sense*, and it is up to the user to choose wisely whether to select an interaction or not.
For the sake of the example, I will choose `climate_bio1_average_X_bias_area_km2`, hypothesizing that ecoregions with higher area (bias_area_km2) and energy (represented by the annual temperature, climate_bio1_average) will have more species of vascular plants (this is just an example, many other rationales are possible when choosing between candidate interactions). The data required to add the interaction to the training data is in the output of `rf_interactions()`.
```{r}
#adding interaction column to the training data
plant_richness_df[, "climate_bio1_average_X_bias_area_km2"] <- interactions$columns[, "climate_bio1_average_X_bias_area_km2"]
#adding interaction name to predictor.variable.names
predictor.variable.names <- c(predictor.variable.names, "climate_bio1_average_X_bias_area_km2")
```
# Reducing multicollinearity in the predictors
The functions [`auto_cor()`](https://blasbenito.github.io/spatialRF/reference/auto_cor.html) and [`auto_vif()`](https://blasbenito.github.io/spatialRF/reference/auto_vif.html) help reduce redundancy in the predictors by using different criteria (bivariate R squared vs. [variance inflation factor](https://www.statisticshowto.com/variance-inflation-factor/)), while allowing the user to define an *order of preference*, which can be based either on domain expertise or on a quantitative assessment. The preference order is defined as a character vector in the `preference.order` argument of both functions, and does not need to include the names of all predictors, but just the ones the user would like to keep in the analysis.
In the example below I give preference to the interaction suggested by `rf_interactions()` over it's two components, and prioritize climate over other types of predictors (any other choice would be valid, it just depends on the scope of the study). These rules are applied to both `auto_cor()` and `auto_vif()`, that are executed sequentially by using the `%>%` pipe from the [magrittr](https://magrittr.tidyverse.org/) package.
Notice that I have set `cor.threshold` and `vif.threshold` to low values because the predictors in `plant_richness_df` already have little multicollinearity,. The default values (`cor.threshold = 0.75` and `vif.threshold = 5`) should work well when combined together for any other set of predictors.
```{r}
preference.order <- c(
"climate_bio1_average_X_bias_area_km2",
"climate_aridity_index_average",
"climate_hypervolume",
"climate_bio1_average",
"climate_bio15_minimum",
"bias_area_km2"
)
predictor.variable.names <- auto_cor(
x = plant_richness_df[, predictor.variable.names],
cor.threshold = 0.6,
preference.order = preference.order
) %>%
auto_vif(
vif.threshold = 2.5,
preference.order = preference.order
)
```
The output of `auto_cor()` or `auto_vif()` is of the class "variable_selection", that can be used as input for the argument `predictor.variable.names` of any modeling function within the package. An example is shown in the next section.
# Fitting a non-spatial Random Forest model
To fit basic Random Forest models `spatialRF` provides the [`rf()`](https://blasbenito.github.io/spatialRF/reference/rf.html) function. It takes the training data, the names of the response and the predictors, and optionally (to assess the spatial autocorrelation of the residuals), the distance matrix, and a vector of distance thresholds (in the same units as the distances in **distance_matrix**).
These distance thresholds are the neighborhoods at which the model will check the spatial autocorrelation of the residuals. Their values may depend on the spatial scale of the data, and the ecological system under study.
Notice that here I plug the object `predictor.variable.names`, output of `auto_cor()` and `auto_vif()`, directly into the `predictor.variable.names` argument.
```{r}
model.non.spatial <- rf(
data = plant_richness_df,
dependent.variable.name = dependent.variable.name,
predictor.variable.names = predictor.variable.names,
distance.matrix = distance_matrix,
distance.thresholds = c(0, 1500, 3000),
seed = 100, #for reproducibility
verbose = FALSE
)
```
The model output can be printed or plotted with a plethora of functions such as [`print()`](https://blasbenito.github.io/spatialRF/reference/print.html), [`print_importance()`](https://blasbenito.github.io/spatialRF/reference/print_importance.html), [`print_performance()`](https://blasbenito.github.io/spatialRF/reference/print_performance.html), [`plot_importance()`](https://blasbenito.github.io/spatialRF/reference/plot_importance.html), [`print_moran()`](https://blasbenito.github.io/spatialRF/reference/print_moran.html), [`plot_moran()`](https://blasbenito.github.io/spatialRF/reference/plot_moran.html), [`plot_response_curves()`](https://blasbenito.github.io/spatialRF/reference/plot_response_curves.html), or [`plot_response_surfaces)`](https://blasbenito.github.io/spatialRF/reference/plot_response_surfaces.html), among many others.
```{r, fig.width = 12, fig.height=7.5}
plot_response_curves(model.non.spatial)
```
In the response curves above, the other predictors are set to their quantiles 0.1, 0.5, and 0.8, but the user can change this behavior by modifying the values of the `quantiles` argument.
```{r, fig.height = 3.2, fig.width=4.5}
plot_response_surfaces(
x = model.non.spatial,
a = "climate_bio1_average",
b = "neighbors_count"
)
```
In this response surface, the predictors that are not shown are set to their medians (but other quantiles are possible).
```{r, fig.width = 6, fig.height=4.5}
plot_importance(model.non.spatial, verbose = FALSE)
```
**Predicting onto new data**
Models fitted with `rf()` and other `rf_X()` functions within the package can be predicted onto new data just as it is done with `ranger()` models:
```{r}
predicted <- stats::predict(
object = model.non.spatial,
data = plant_richness_df,
type = "response"
)$predictions
```
**Repeating a model execution**
Random Forest is an stochastic algorithm that yields slightly different results on each run unless a random seed is set. This particularity has implications for the interpretation of variable importance scores. For example, in the plot above, the difference in importance between the predictors `climate_hypervolume` and `climate_bio1_average_X_bias_area_km2` could be just the result of chance. The function [`rf_repeat()`](https://blasbenito.github.io/spatialRF/reference/rf_repeat.html) repeats a model execution and yields the distribution of importance scores of the predictors across executions.
```{r, fig.width = 6, fig.height=4.5}
model.non.spatial.repeat <- rf_repeat(
model = model.non.spatial,
repetitions = 30,
verbose = FALSE
)
plot_importance(model.non.spatial.repeat, verbose = FALSE)
```
After 30 model repetitions it is clear that the difference in importance between `climate_hypervolume` and `climate_bio1_average_X_bias_area_km2` is not the result of chance.
The response curves of models fitted with `rf_repeat()` can be plotted with `plot_response_curves()` as well. The median prediction is shown with a thicker line.
```{r, fig.width = 12, fig.height=7.5}
plot_response_curves(model.non.spatial.repeat, quantiles = 0.5)
```
The function `get_response_curves()` returns a data frame with the data required to make custom plots of the response curves.
```{r}
plot.df <- get_response_curves(model.non.spatial.repeat)
```
```{r, echo = FALSE, warning = FALSE, message=FALSE}
kableExtra::kbl(
head(plot.df, n = 10),
format = "markdown"
) %>%
kableExtra::kable_paper("hover", full_width = F)
```
# Tuning Random Forest hyperparameters
The model fitted above was based on the default hyperparameter values provided by `ranger()`, and those might not be the most adequate ones for a given dataset. The function [`rf_tuning()`](https://blasbenito.github.io/spatialRF/reference/rf_tuning.html) helps the user to choose sensible values for three Random Forest hyperparameters that are critical to model performance:
- `num.trees`: number of regression trees in the forest.
- `mtry`: number of variables to choose from on each tree split.
- `min.node.size`: minimum number of cases on a terminal node.
Model tuning can be done on out-of-bag (`method = "oob"`) or spatial cross-validation (`method = "spatial.cv"`) R squared values. The example below shows the out-of-bag approach because I will explain spatial cross-validation with `rf_evaluate()` later in this document.
```{r, fig.width=6, fig.height=4.5}
model.non.spatial.tuned <- rf_tuning(
model = model.non.spatial,
method = "spatial.cv",
xy = plant_richness_df[, c("x", "y")],
repetitions = 30,
num.trees = c(500, 1000),
mtry = seq(
2,
14, #equal or lower than the number of predictors
by = 3
),
min.node.size = c(5, 10, 20)
)
```
The function `rf_tuning()` returns a model fitted with the same data as the original model, but using the best hyperparameters found during tuning. Model tuning has helped to a very small improvement in performance measures (+ `r round(model.non.spatial.tuned$tuning$r.squared.gain, 3)` R squared), so from here, we can keep working with `model.non.spatial.tuned`.
# Fitting a spatial model
The spatial autocorrelation of the residuals of `model.non.spatial`, measured with [Moran's I](https://en.wikipedia.org/wiki/Moran%27s_I), can be plotted with [`plot_moran()`](https://blasbenito.github.io/spatialRF/reference/plot_moran.html).
```{r, fig.width=6, fig.height=3, message=FALSE, warning=FALSE}
plot_moran(model.non.spatial.tuned, verbose = FALSE)
```
According to the plot, the spatial autocorrelation of the residuals is highly positive for a neighborhood of 0 km, while it becomes non-significant (p-value \> 0.05, whatever that means) at 1500 and 3000 km. To reduce the spatial autocorrelation of the residuals as much as possible, the non-spatial tuned model fitted above can be converted into a spatial model easily with [`rf_spatial()`](https://blasbenito.github.io/spatialRF/reference/rf_spatial.html), that by default uses the Moran's Eigenvector Maps method.
```{r}
model.spatial <- rf_spatial(
model = model.non.spatial.tuned,
method = "mem.moran.sequential", #default method
verbose = FALSE,
seed = 100
)
```
The plot below shows the Moran's I of the residuals of the spatial model. It shows that `rf_spatial()` has managed to remove the spatial autocorrelation (p-values of the Moran's I estimates for each neighborhood distance are higher than 0.05) of the model residuals for every neighborhood distance.
```{r, fig.width=6, fig.height=3, message=FALSE, warning=FALSE}
plot_moran(model.spatial, verbose = FALSE)
```
If we compare the variable importance plots of both models, we can see that the spatial model has an additional set of dots under the name "spatial_predictors", and that the maximum importance of a few of these spatial predictors matches the importance of the most relevant non-spatial predictors.
```{r, fig.width=10, fig.height=4.5}
p1 <- plot_importance(
model.non.spatial,
verbose = FALSE) +
ggplot2::ggtitle("Non-spatial model")
p2 <- plot_importance(
model.spatial,
verbose = FALSE) +
ggplot2::ggtitle("Spatial model")
p1 | p2
```
If we take a look to the ten most important variables in `model.spatial` we will see that a few of them are spatial predictors.
```{r, echo = FALSE, warning = FALSE, message=FALSE}
kableExtra::kbl(
head(model.spatial$variable.importance$per.variable, n = 10),
format = "markdown"
) %>%
kableExtra::kable_paper("hover", full_width = F)
```
Spatial predictors are named `spatial_predictor_X_Y`, where `X` is the neighborhood distance at which the predictor has been generated, and `Y` is the index of the predictor.
Spatial predictors, as shown below, are smooth surfaces representing neighborhood among records at different spatial scales.
```{r, echo=FALSE, fig.width=12, fig.height=7}
spatial.predictors <- get_spatial_predictors(model.spatial)
pr <- data.frame(spatial.predictors, plant_richness_df[, c("x", "y")])
p1 <- ggplot2::ggplot() +
ggplot2::geom_sf(data = world, fill = "white") +
ggplot2::geom_point(
data = pr,
ggplot2::aes(
x = x,
y = y,
color = spatial_predictor_0_2
),
size = 2.5
) +
ggplot2::scale_color_viridis_c(direction = -1) +
ggplot2::theme_bw() +
ggplot2::labs(color = "Eigenvalue") +
ggplot2::scale_x_continuous(limits = c(-170, -30)) +
ggplot2::scale_y_continuous(limits = c(-58, 80)) +
ggplot2::ggtitle("Variable: spatial_predictor_0_2") +
ggplot2::theme(legend.position = "bottom")+
ggplot2::xlab("Longitude") +
ggplot2::ylab("Latitude")
p2 <- ggplot2::ggplot() +
ggplot2::geom_sf(data = world, fill = "white") +
ggplot2::geom_point(
data = pr,
ggplot2::aes(
x = x,
y = y,
color = spatial_predictor_3000_1,
),
size = 2.5
) +
ggplot2::scale_color_viridis_c(direction = -1) +
ggplot2::theme_bw() +
ggplot2::labs(color = "Eigenvalue") +
ggplot2::scale_x_continuous(limits = c(-170, -30)) +
ggplot2::scale_y_continuous(limits = c(-58, 80)) +
ggplot2::ggtitle("Variable: spatial_predictor_3000_1") +
ggplot2::theme(legend.position = "bottom") +
ggplot2::xlab("Longitude") +
ggplot2::ylab("")
p1 | p2
```
The spatial predictors in the spatial model have been generated using the method "mem.moran.sequential" (function's default), that mimics the Moran's Eigenvector Maps method described in [(Dray, Legendre, and Peres-Neto 2006)](https://www.sciencedirect.com/science/article/abs/pii/S0304380006000925).
In brief, the method consist on transforming the distance matrix into a double-centered matrix of normalized weights, to then compute the positive eigenvectors of the weights matrix (a.k.a, Moran's Eigenvector Maps, or MEMs).
The MEMs are included in the model one by one in the order of their Moran's I, and the subset of MEMs maximizing the model's R squared and minimizing the Moran's I of the residuals and the number of MEMs added to the model are selected, as shown in the optimization plot below (dots linked by lines represent the selected spatial predictors). The selection procedure is performed by the function [`select_spatial_predictors_sequential()`](https://blasbenito.github.io/spatialRF/reference/select_spatial_predictors_sequential.html).
```{r, echo=FALSE, fig.width=6, fig.height=4}
p <- plot_optimization(model.spatial)
```
**Tuning spatial models**
Spatial models fitted with `rf_spatial()` can be tuned as well with `rf_tuning()`. However, tuning may in some cases increase the spatial autocorrelation of the model residuals. In that case, the function will return a message explaining the situation, and the original model without any sort of tuning applied
```{r, fig.width=6, fig.height=4.5}
model.spatial.tuned <- rf_tuning(
model = model.spatial,
method = "spatial.cv",
xy = plant_richness_df[, c("x", "y")],
repetitions = 30,
num.trees = c(500, 1000),
mtry = seq(
2,
length(model.spatial$ranger.arguments$predictor.variable.names),
by = 9),
min.node.size = c(5, 20)
)
```
# Assessing model performance on spatially independent folds
The function [`rf_evaluate()`](https://blasbenito.github.io/spatialRF/reference/rf_evaluate.html) separates the training data into a number of spatially independent training and testing folds, fits a model on each training fold, predicts over each testing fold, and computes performance measures, to finally aggregate them across model repetitions. Let's see how it works.
```{r}
model.spatial.tuned <- rf_evaluate(
model = model.spatial.tuned,
xy = plant_richness_df[, c("x", "y")], #data coordinates
repetitions = 30, #number of folds
training.fraction = 0.8, #training data fraction
metrics = c("r.squared", "rmse"),
verbose = FALSE
)
```
The function generates a new slot in the model named "evaluation" with several objects that summarize the spatial cross-validation results.
```{r}
names(model.spatial.tuned$evaluation)
```
The slot "spatial.folds", produced by [`make_spatial_folds()`](https://blasbenito.github.io/spatialRF/reference/make_spatial_folds.html), contains the indices of the training and testing cases for each cross-validation repetition. The maps below show two sets of training and testing spatial folds.
```{r, echo=FALSE, fig.width=10, fig.height=5}
pr <- plant_richness_df[, c("x", "y")]
pr$group.2 <- pr$group.1 <- "Training"
pr[model.spatial.tuned$evaluation$spatial.folds[[1]]$testing, "group.1"] <- "Testing"
pr[model.spatial.tuned$evaluation$spatial.folds[[25]]$testing, "group.2"] <- "Testing"
p1 <- ggplot2::ggplot() +
ggplot2::geom_sf(data = world, fill = "white") +
ggplot2::geom_point(data = pr,
ggplot2::aes(
x = x,
y = y,
color = group.1
),
size = 2.5
) +
ggplot2::scale_color_viridis_d(direction = -1, end = 0.8, alpha = 0.6) +
ggplot2::theme_bw() +
ggplot2::labs(color = "Group") +
ggplot2::scale_x_continuous(limits = c(-170, -30)) +
ggplot2::scale_y_continuous(limits = c(-58, 80)) +
ggplot2::ggtitle("Spatial fold 1") +
ggplot2::theme(legend.position = "none") +
ggplot2::xlab("Longitude") +
ggplot2::ylab("Latitude")
p2 <- ggplot2::ggplot() +
ggplot2::geom_sf(data = world, fill = "white") +
ggplot2::geom_point(data = pr,
ggplot2::aes(
x = x,
y = y,
color = group.2
),
size = 2.5
) +
ggplot2::scale_color_viridis_d(direction = -1, end = 0.8, alpha = 0.6) +
ggplot2::theme_bw() +
ggplot2::labs(color = "Group") +
ggplot2::scale_x_continuous(limits = c(-170, -30)) +
ggplot2::scale_y_continuous(limits = c(-58, 80)) +
ggplot2::ggtitle("Spatial fold 25") +
ggplot2::xlab("Longitude") +
ggplot2::ylab("")
p1 | p2
```
The functions [`plot_evaluation()`](https://blasbenito.github.io/spatialRF/reference/plot_evaluation.html) and [`print_evaluation()`](https://blasbenito.github.io/spatialRF/reference/print_evaluation.html) allow to check the evaluation results as a plot or as a table.
```{r, fig.width=4, fig.height=3.5}
print_evaluation(model.spatial.tuned)
```
The low R squared yielded by the model evaluation shows that the spatial model is hard to transfer outside of the training space. Models based on a spatial structure like the ones fitted with `rf_spatial()` do not work well when transferred to a different place (that is what `rf_compare()` does), because spatial structures are not transferable when the data is irregularly distributed, as it is the case with `plant_richness_df`. The comparison below shows how non-spatial models may show better (not bad, not great) evaluation scores on independent spatial folds.
# Comparing several models
The function `rf_evaluate()` only assesses the predictive performance on unseen data of one model at a time. If the goal is to compare two models, `rf_evaluate()` can be indeed ran twice, but `spatialRF` offers a more convenient option named [`rf_compare()`](https://blasbenito.github.io/spatialRF/reference/rf_compare.html). It takes as input a named list with as many models as the user needs to compare.
```{r, fig.width=6, fig.height=5}
comparison <- rf_compare(
models = list(
`Non-spatial` = model.non.spatial,
`Non-spatial tuned` = model.non.spatial.tuned,
`Spatial` = model.spatial,
`Spatial tuned` = model.spatial.tuned
),
xy = plant_richness_df[, c("x", "y")],
repetitions = 30,
training.fraction = 0.8,
metrics = c("r.squared", "rmse"),
notch = TRUE
)
```
```{r, echo = FALSE, message=FALSE, warning=FALSE}
x <- comparison$comparison.df %>%
dplyr::group_by(model, metric) %>%
dplyr::summarise(value = round(median(value), 3)) %>%
dplyr::arrange(metric) %>%
as.data.frame()
colnames(x) <- c("Model", "Metric", "Mean")
kableExtra::kbl(
x,
format = "markdown"
) %>%
kableExtra::kable_paper("hover", full_width = F)
```
# Generating spatial predictors for other models
You might not love Random Forest, but `spatialRF` loves you, and as such, it gives you tools to generate spatial predictors for other models anyway.
The first step requires generating Moran's Eigenvector Maps (MEMs) from the distance matrix. Here there are two options, computing MEMs for a single neighborhood distance with [`mem()`](https://blasbenito.github.io/spatialRF/reference/mem.html), and computing MEMs for several neighborhood distances at once with [`mem_multithreshold()`](https://blasbenito.github.io/spatialRF/reference/mem_multithreshold.html).
```{r}
#single distance (0km by default)
mems <- mem(x = distance_matrix)
#several distances
mems <- mem_multithreshold(
x = distance_matrix,
distance.thresholds = c(0, 1000, 2000)
)
```
In either case the result is a data frame with Moran's Eigenvector Maps ("just" the positive eigenvectors of the double-centered distance matrix).
```{r, echo = FALSE, warning = FALSE, message=FALSE}
kableExtra::kbl(
head(mems[, 1:4], n = 10),
format = "markdown"
) %>%
kableExtra::kable_paper("hover", full_width = F)
```
But not all MEMs are made equal, and you will need to rank them by their Moran's I. The function [`rank_spatial_predictors()`](https://blasbenito.github.io/spatialRF/reference/rank_spatial_predictors.html) will help you do so.
```{r}
mem.rank <- rank_spatial_predictors(
distance.matrix = distance_matrix,
spatial.predictors.df = mems
)
```
The output of `rank_spatial_predictors()` is a list with three slots: "method", a character string with the name of the ranking method; "criteria", an ordered data frame with the criteria used to rank the spatial predictors; and "ranking", a character vector with the names of the spatial predictors in the order of their ranking (it is just the first column of the "criteria" data frame). We can use this "ranking" object to reorder or `mems` data frame.
```{r}
mems <- mems[, mem.rank$ranking]
#also:
#mems <- mem.rank$spatial.predictors.df
```
From here, spatial predictors can be included in any model one by one, in the order of the ranking, until the spatial autocorrelation of the residuals is gone, or our model gets totally defaced. A little example with a linear model follows.
```{r}
#model definition
predictors <- c(
"climate_aridity_index_average ",
"climate_bio1_average",
"bias_species_per_record",
"human_population_density",
"topography_elevation_average",
"fragmentation_division"
)
model.formula <- as.formula(
paste(
dependent.variable.name,
" ~ ",
paste(
predictors,
collapse = " + "
)
)
)
#scaling the data
model.data <- scale(plant_richness_df) %>%
as.data.frame()
#fitting the model
m <- lm(model.formula, data = plant_richness_df)
#Moran's I test of the residuals
moran.test <- moran(
x = residuals(m),
distance.matrix = distance_matrix,
)
moran.test
```
According to the Moran's I test, the model residuals show spatial autocorrelation. Let's introduce MEMs one by one until the problem is solved.
```{r}
#add mems to the data and applies scale()
model.data <- data.frame(
plant_richness_df,
mems
) %>%
scale() %>%
as.data.frame()
#initialize predictors.i
predictors.i <- predictors
#iterating through MEMs
for(mem.i in colnames(mems)){
#add mem name to model definintion
predictors.i <- c(predictors.i, mem.i)
#generate model formula with the new spatial predictor
model.formula.i <- as.formula(
paste(
dependent.variable.name,
" ~ ",
paste(
predictors.i,
collapse = " + "
)
)
)
#fit model
m.i <- lm(model.formula.i, data = model.data)
#Moran's I test
moran.test.i <- moran(
x = residuals(m.i),
distance.matrix = distance_matrix,
)
#stop if no autocorrelation
if(moran.test.i$interpretation != "Positive spatial correlation"){
break
}
}#end of loop
```
Now we can compare the model without spatial predictors `m` and the model with spatial predictors `m.i`.
```{r, echo = FALSE, warning = FALSE, message=FALSE}
comparison.df <- data.frame(
Model = c("Non-spatial", "Spatial"),
Predictors = c(length(predictors), length(predictors.i)),
R_squared = round(c(summary(m)$r.squared, summary(m.i)$r.squared), 2),
AIC = round(c(AIC(m), AIC(m.i)), 0),
BIC = round(c(BIC(m), BIC(m.i)), 0),
`Moran I` = round(c(moran.test$moran.i, moran.test.i$moran.i), 2)
)
kableExtra::kbl(
comparison.df,
format = "markdown"
) %>%
kableExtra::kable_paper("hover", full_width = F)
```
According to the model comparison, it can be concluded that the addition of spatial predictors, in spite of the increase in complexity, has improved the model. In any case, this is just a simple demonstration of how spatial predictors generated with functions of the `spatialRF` package can still help you fit spatial models with other modeling methods.