forked from JuliaEarth/geospatial-data-science-with-julia
-
Notifications
You must be signed in to change notification settings - Fork 0
/
10-correlation.qmd
510 lines (375 loc) · 15.5 KB
/
10-correlation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
# Geospatial correlation
```{julia}
#| echo: false
#| output: false
import Pkg
Pkg.activate(".")
using Random
using GeoStats
import CairoMakie as Mke
```
In **Part II** and **Part III** of the book, we learned two important
tools for *efficient* geospatial data science. We learned how **transform
pipelines** can be used to prepare geospatial data for investigation, and
how **geospatial queries** can be used to answer geoscientific questions.
Before we can learn our third tool, we need to review the important concept
of **geospatial correlation**:
::: {.callout-tip}
## Definition
Given $n$ pairs of measurements $\{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$
from variables $X$ and $Y$ that are $h$ units of distance apart, we define
**geospatial correlation** as the sample
[Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient):
$$
cor_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}
$$
where $\bar{x}$ and $\bar{y}$ are the mean values.
:::
Let's consider the following synthetic image to illustrate the concept
for different values of $h$:
```{julia}
using GeoIO
img = GeoIO.load("data/gaussian.gslib")
img |> viewer
```
The `hscatter` plot can be used to visualize the scatter of pairs
$\{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$ at a given lag $h$.
We can choose a variable $X$ for the horizontal axis, a (possibly different)
variable $Y$ for the vertical axis, and the value of the lag $h$. In
order to reduce the computational costs associated with the plot, we
will sample a subset of measurements from the image:
```{julia}
sample = img |> Sample(1000, replace=false)
```
If we plot the values of the variable `Z` in the horizontal axis and the values
of the same variable measured at lag $h=0$ on the vertical axis, we get points
along the identity line (i.e. no scatter):
```{julia}
hscatter(sample, :Z, :Z, lag=0.0)
```
By increasing the value of the lag, we observe that the correlation is no longer
equal to one, and that the linear fit through the points approaches the horizontal
axis (i.e., zero correlation):
```{julia}
hscatter(sample, :Z, :Z, lag=3.0)
```
```{julia}
hscatter(sample, :Z, :Z, lag=5.0)
```
```{julia}
hscatter(sample, :Z, :Z, lag=10.0)
```
```{julia}
hscatter(sample, :Z, :Z, lag=50.0)
```
The Pearson correlation coefficient studied as a function of the lag $h$
is known as the **correlogram function**. For example, consider the exponential
correlogram function given by $cor(h) = \exp(-h)$:
```{julia}
#| echo: false
xs = range(0, stop=10, length=100)
ys = exp.(-xs)
Mke.lines(xs, ys, axis = (xlabel="h", ylabel="cor(h)"), color="black")
```
- The correlogram $cor(h)$ is often a non-increasing function
- It coincides with the usual correlation in data science at $h=0$
- $cor(h) \to 0$ as $h \to \infty$ in most practical cases
The terms **auto-correlogram** ($X = Y$) and **cross-correlogram** ($X \ne Y$) are also
encountered in the literature to differentiate the various geospatial correlations in
the multivariate case. Similarly, the terms **auto-covariance** and **cross-covariance**
are encountered by replacing the correlation by the covariance (non-normalized correlation).
Even though the correlogram function is widely used in other scientific fields, we will review
an alternative statistic of association that is most useful in geospatial data science.
## Variography
::: {.callout-tip}
## Definition
The **variogram function** is a more general alternative to the correlogram and covariance
functions that does **not** rely on the mean values $\bar{x}$ and $\bar{y}$. It is given by
$$
\gamma_x(h) \approx \frac{1}{2|N(h)|}\sum_{N(h)}(x_i-x_j)^2
$$
where $N(h) = \Big\{(i,j): i\underbrace{\longrightarrow}_{h \text{ units}} j\Big\}$ is the
set of pairs of locations that are $h$ units apart.
In the multivariate case, we can also define the **cross-variogram**:
$$
\gamma_{xy}(h) \approx \frac{1}{2|N(h)|}\sum_{N(h)}(x_i-x_j)(y_i-y_j)
$$
:::
The value $\gamma(h)$ measures the "spread" of the `hscatter` plot. Usually,
at $h=0$ there is no spread, and hence $\gamma(0) = 0$. In most practical
cases $\gamma(h) \to \sigma^2$ as $h \to \infty$ where $\sigma^2$ is the
maximum variance of the process. When this maximum variance exists, we
can write the following relation:
$$
\gamma(h) = \sigma^2 - cov(h)
$$
```{julia}
#| echo: false
xs = range(0, stop=10, length=100)
ys = 1 .- exp.(-xs)
Mke.lines(xs, ys, axis = (xlabel="h", ylabel="γ(h)"), color="black")
```
where $cov(h)$ is the covariance function, a version of the correlogram
function that is scaled by the standard deviations of X and Y:
$$
cor(h) = \frac{cov(h)}{\sigma_x \sigma_y}
$$
Explaining why the variogram is more general than the covariance is out
of scope for this book, but it has to do with the fact that variograms
operate on the "difference process" $(x_i - x_j)$ as opposed to the centered
process $(x_i - \bar{x})$. In particular, it does not require a finite maximum
variance $\sigma^2$.
::: {.callout-note}
The theory of intrinsic random functions of order k (IRF-k) is an advanced concept
from geostatistical theory that explains the generality of the variogram function
[@Chiles2012].
:::
Our main goal here is to gain intuition about the variogram function for interpolation
purposes. It suffices to learn its four basic elements: **range**, **sill**, **nugget** and **model**.
```{julia}
#| echo: false
r = 10.0
s = 1.0
n = 0.1
m = GaussianVariogram
g = m(range=r, sill=s, nugget=n)
fig = Mke.Figure()
Mke.Axis(fig[1,1], xlabel="h", ylabel="γ(h)", limits = (nothing, nothing, 0, 1.1))
Mke.plot!(g, maxlag=25, label="model")
Mke.vlines!(r, color="black", linestyle=:dash, label="range")
Mke.hlines!(s, color="teal", linestyle=:dash, label="sill")
Mke.hlines!(n, color="teal", linestyle=:dot, label="nugget")
Mke.axislegend("Elements", position = :rc)
fig
```
### range
The **range** (a.k.a. correlation length) of the variogram determines the average
size of "blobs" in the image. Let's consider two synthetic images with ranges
10 and 30, respectively:
```{julia}
#| echo: false
g = CartesianGrid(100, 25)
p1 = GaussianProcess(GaussianVariogram(range=10.0))
p2 = GaussianProcess(GaussianVariogram(range=30.0))
Random.seed!(2023)
d1 = rand(p1, g, [:Z => Float64], LUMethod())
d2 = rand(p2, g, [:Z => Float64], LUMethod())
fig = Mke.Figure()
viz(fig[1,1], d1.geometry, color = d1.Z, axis = (; title="range: 10"))
viz(fig[2,1], d2.geometry, color = d2.Z, axis = (; title="range: 30"))
fig
```
In the first image, we can clearly visualize the average size of yellow and blue blobs
around 10 pixels (i.e. quadrangles). In the second image, the blobs have an average size
of 30 pixels, which is greater than one of the sides of the grid (100x25 pixels).
### sill
To understand the **sill** of the variogram, let's consider a 1D grid as our domain, and
let's represent the values of the variable with height instead of color:
```{julia}
#| echo: false
g = CartesianGrid(100)
p1 = GaussianProcess(GaussianVariogram(range=10.0, sill=1.0))
p2 = GaussianProcess(GaussianVariogram(range=10.0, sill=9.0))
Random.seed!(2023)
d1 = rand(p1, g, [:Z => Float64], LUMethod())
d2 = rand(p2, g, [:Z => Float64], LUMethod())
f(x) = ustrip(first(to(centroid(x))))
fig = Mke.Figure()
Mke.lines(fig[1,1], f.(d1.geometry), d1.Z, color = "black", axis = (; title="sill: 1"))
Mke.lines(fig[2,1], f.(d2.geometry), d2.Z, color = "black", axis = (; title="sill: 9"))
fig
```
The **sill** determines the maximum variance of the process. If the sill is $\sigma^2$,
then a process with mean $\mu$ will oscillate within $\mu\pm3\sigma$ with 99.7% probability.
The vertical amplitude in the second plot is (3x) larger than that of the first plot.
In both plots, we have $\mu=0$.
### nugget
The **nugget** can be used to insert additional variance at scales that are smaller than
the scale of measurements (i.e. pixel). It is known in the image processing literature as
[salt-and-pepper noise](https://en.wikipedia.org/wiki/Salt-and-pepper_noise):
```{julia}
#| echo: false
g = CartesianGrid(100)
p1 = GaussianProcess(GaussianVariogram(range=10.0, nugget=0.1))
p2 = GaussianProcess(GaussianVariogram(range=10.0, nugget=0.2))
Random.seed!(2023)
d1 = rand(p1, g, [:Z => Float64], LUMethod())
d2 = rand(p2, g, [:Z => Float64], LUMethod())
f(x) = ustrip(first(to(centroid(x))))
fig = Mke.Figure()
Mke.lines(fig[1,1], f.(d1.geometry), d1.Z, color = "black", axis = (; title="nugget: 0.1"))
Mke.lines(fig[2,1], f.(d2.geometry), d2.Z, color = "black", axis = (; title="nugget: 0.2"))
fig
```
We can visualize the nugget effect in our 2D grid with colors as before:
```{julia}
#| echo: false
g = CartesianGrid(100, 25)
p1 = GaussianProcess(GaussianVariogram(range=10.0, nugget=0.0))
p2 = GaussianProcess(GaussianVariogram(range=10.0, nugget=0.1))
Random.seed!(2023)
d1 = rand(p1, g, [:Z => Float64], LUMethod())
Random.seed!(2023)
d2 = rand(p2, g, [:Z => Float64], LUMethod())
fig = Mke.Figure()
viz(fig[1,1], d1.geometry, color = d1.Z, axis = (; title="nugget: 0.0"))
viz(fig[2,1], d2.geometry, color = d2.Z, axis = (; title="nugget: 0.1"))
fig
```
::: {.callout-note}
The name "nugget" comes from gold nuggets in mining geostatistics.
These are often much smaller than selective mining units (SMUs), and
show as bright values in the 3D model of the mineral deposit.
:::
### model
Finally, the **model** of the variogram determines how the function increases
near the origin. The GeoStats.jl framework provides dozens of such models of
geospatial correlation. The most widely used are the `GaussianVariogram`, the
`SphericalVariogram` and the `ExponentialVariogram`:
```{julia}
#| echo: false
γ1 = GaussianVariogram()
γ2 = SphericalVariogram()
γ3 = ExponentialVariogram()
fig = Mke.Figure()
Mke.Axis(fig[1,1])
Mke.plot!(γ1, maxlag=2, vcolor = "teal", label = "Gaussian")
Mke.plot!(γ2, maxlag=2, vcolor = "slategray3", label = "Spherical")
Mke.plot!(γ3, maxlag=2, vcolor = "brown", label = "Exponential")
Mke.axislegend("Model", position = :rb)
fig
```
The faster is the increase of the function near the origin, the more "erratic" is the process:
```{julia}
#| echo: false
g = CartesianGrid(100, 25)
p1 = GaussianProcess(GaussianVariogram(range=10.0))
p2 = GaussianProcess(SphericalVariogram(range=10.0))
p3 = GaussianProcess(ExponentialVariogram(range=10.0))
Random.seed!(2023)
d1 = rand(p1, g, [:Z => Float64], LUMethod())
d2 = rand(p2, g, [:Z => Float64], LUMethod())
d3 = rand(p3, g, [:Z => Float64], LUMethod())
fig = Mke.Figure()
viz(fig[1,1], d1.geometry, color = d1.Z, axis = (; title="model: Gaussian"))
viz(fig[2,1], d2.geometry, color = d2.Z, axis = (; title="model: Spherical"))
viz(fig[3,1], d3.geometry, color = d3.Z, axis = (; title="model: Exponential"))
fig
```
All the four elements of the variogram function can be easily set at construction time:
```{julia}
γ = GaussianVariogram(range=10.0, sill=2.0, nugget=0.1)
```
And queried later with the corresponding functions:
```{julia}
range(γ), sill(γ), nugget(γ)
```
We can evaluate the variogram at any given lag:
```{julia}
γ(1.0)
```
Or evaluate the variogram between any two points:
```{julia}
γ(Point(0, 0), Point(1, 0))
```
In this case, the `Euclidean` metric is used by default to compute the lag.
More generally, we can evaluate the variogram between any two geometries:
```{julia}
γ(Point(0, 0), Triangle((0, 0), (1, 0), (1, 1)))
```
::: {.callout-note}
The evaluation of the variogram function between two geometries is known as
variogram regularization, and implemented in terms of numerical integration.
:::
Remind that the variogram value $\gamma(h)$ is a measure of spread in the
`hscatter` plot. It tells how much variation is expected for a variable
at a distance $h$ from a reference point.
## Fitting models
Given geospatial data, how do we fit an appropriate variogram model for it?
This practical question is traditionally answered in two steps as follows.
### Empirical estimate
Let's recap the synthetic image from the beginning of the chapter:
```{julia}
img |> viewer
```
We can use the `EmpiricalVariogram` to estimate the function at specific
lag values:
```{julia}
g = EmpiricalVariogram(img, :Z, maxlag = 50.0)
```
```{julia}
Mke.plot(g)
```
::: {.callout-note}
The numbers and bars in the empirical variogram plot represent the number
of pairs used to estimate the value of the variogram at the corresponding bin.
The larger the number, the more confident we can be in the estimate.
:::
The `DirectionalVariogram` can be used to estimate the function along
specific directions:
```{julia}
gₕ = DirectionalVariogram((1.0, 0.0), img, :Z, maxlag = 50.0)
gᵥ = DirectionalVariogram((0.0, 1.0), img, :Z, maxlag = 50.0)
Mke.plot(gₕ, hshow = false, vcolor = "maroon")
Mke.plot!(gᵥ, hshow = false, vcolor = "slategray")
Mke.current_figure()
```
In this example, we observe that the blobs are elongated with a horizontal
range of 30 pixels and a vertical range of 10 pixels. This is known as
geometric **anisotropy**.
We can also estimate the variogram in all directions on a plane with the
`EmpiricalVarioplane`:
```{julia}
gₚ = EmpiricalVarioplane(img, :Z, maxlag = 50.0)
```
The varioplane is usually plotted on a polar axis to highlight the different
ranges as a function of the polar angle. These ranges are illustrated with a
solid curve over the heatmap:
```{julia}
fig = Mke.Figure()
ax = Mke.PolarAxis(fig[1,1])
Mke.plot!(ax, gₚ)
fig
```
::: {.callout-note}
The book by @Webster2007 and the article by @Cressie1980 are good resources
to learn more about robust variogram estimation.
:::
### Least-squares fit
After empirical variogram estimation, the next step consists of fitting a
theoretical model. This step is necessary for interpolation given that we need
to be able to evaluate the variogram function at any lag $h$, not just the
specific lags of the empirical variogram.
::: {.callout-note}
Another reason to fit theoretical models is to ensure that variances of linear
combinations of variables are always non-negative as discussed in @Myers1992.
:::
To fit a specific theoretical model, we can use the `fit` function with the
model as the first argument:
```{julia}
GeoStatsFunctions.fit(SphericalVariogram, g)
```
We can also let the framework select the model with minimum weighted least-squares
error by passing the generic `Variogram` model to the function:
```{julia}
γ = GeoStatsFunctions.fit(Variogram, g)
```
## Remarks
This chapter is definitely one of the most challenging ones for those with little
background in geostatistics. Let's make a few important remarks to summarize what
we learned:
- Geospatial correlation can be represented with different functions, including
the **correlogram**, the **covariance** and the **variogram** functions. Among
these functions the variogram is the most general and easy to interpret as a
measure of "spread" in the `hscatter` plot.
- The variogram value $\gamma(h)$ represents the expected variation of a variable
that is $h$ units of distance from a reference point. It usually starts at
$\gamma(0) = 0$, reaches a **sill** value $\sigma^2$ near the **range** and
stays at this value as $h \to \infty$.
- The selection of an appropriate theoretical variogram model for interpolation of
geospatial data is often based on a two-step procedure. First, we estimate the
`EmpiricalVariogram`, and then we `fit` a theoretical model. The most widely
used models are the `GaussianVariogram`, the `SphericalVariogram` and the
`ExponentialVariogram`.
In the next chapter, we will learn how to perform **geospatial interpolation** with
the selected theoretical variogram model.