10-correlation.qmd

# Geospatial correlation

```{julia}
#| echo: false
#| output: false
import Pkg
Pkg.activate(".")
using Random
using GeoStats
import CairoMakie as Mke
```

In **Part II** and **Part III** of the book, we learned two important
tools for *efficient* geospatial data science. We learned how **transform
pipelines** can be used to prepare geospatial data for investigation, and
how **geospatial queries** can be used to answer geoscientific questions.
Before we can learn our third tool, we need to review the important concept
of **geospatial correlation**:

::: {.callout-tip}

## Definition

Given $n$ pairs of measurements $\{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$
from variables $X$ and $Y$ that are $h$ units of distance apart, we define
**geospatial correlation** as the sample
[Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient):

$$
cor_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}
$$

where $\bar{x}$ and $\bar{y}$ are the mean values.

:::

Let's consider the following synthetic image to illustrate the concept
for different values of $h$:

```{julia}
using GeoIO

img = GeoIO.load("data/gaussian.gslib")

img |> viewer
```

The `hscatter` plot can be used to visualize the scatter of pairs
$\{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$ at a given lag $h$.
We can choose a variable $X$ for the horizontal axis, a (possibly different)
variable $Y$ for the vertical axis, and the value of the lag $h$. In
order to reduce the computational costs associated with the plot, we
will sample a subset of measurements from the image:

```{julia}
sample = img |> Sample(1000, replace=false)
```

If we plot the values of the variable `Z` in the horizontal axis and the values
of the same variable measured at lag $h=0$ on the vertical axis, we get points
along the identity line (i.e. no scatter):

```{julia}
hscatter(sample, :Z, :Z, lag=0.0)
```

By increasing the value of the lag, we observe that the correlation is no longer
equal to one, and that the linear fit through the points approaches the horizontal
axis (i.e., zero correlation):

```{julia}
hscatter(sample, :Z, :Z, lag=3.0)
```

```{julia}
hscatter(sample, :Z, :Z, lag=5.0)
```

```{julia}
hscatter(sample, :Z, :Z, lag=10.0)
```

```{julia}
hscatter(sample, :Z, :Z, lag=50.0)
```

The Pearson correlation coefficient studied as a function of the lag $h$
is known as the **correlogram function**. For example, consider the exponential
correlogram function given by $cor(h) = \exp(-h)$:

```{julia}
#| echo: false
xs = range(0, stop=10, length=100)
ys = exp.(-xs)
Mke.lines(xs, ys, axis = (xlabel="h", ylabel="cor(h)"), color="black")
```

- The correlogram $cor(h)$ is often a non-increasing function
- It coincides with the usual correlation in data science at $h=0$
- $cor(h) \to 0$ as $h \to \infty$ in most practical cases

The terms **auto-correlogram** ($X = Y$) and **cross-correlogram** ($X \ne Y$) are also
encountered in the literature to differentiate the various geospatial correlations in
the multivariate case. Similarly, the terms **auto-covariance** and **cross-covariance**
are encountered by replacing the correlation by the covariance (non-normalized correlation).

Even though the correlogram function is widely used in other scientific fields, we will review
an alternative statistic of association that is most useful in geospatial data science.

## Variography

::: {.callout-tip}

## Definition

The **variogram function** is a more general alternative to the correlogram and covariance
functions that does **not** rely on the mean values $\bar{x}$ and $\bar{y}$. It is given by

$$
\gamma_x(h) \approx \frac{1}{2|N(h)|}\sum_{N(h)}(x_i-x_j)^2
$$

where $N(h) = \Big\{(i,j): i\underbrace{\longrightarrow}_{h \text{ units}} j\Big\}$ is the
set of pairs of locations that are $h$ units apart.

In the multivariate case, we can also define the **cross-variogram**:

$$
\gamma_{xy}(h) \approx \frac{1}{2|N(h)|}\sum_{N(h)}(x_i-x_j)(y_i-y_j)
$$

:::

The value $\gamma(h)$ measures the "spread" of the `hscatter` plot. Usually,
at $h=0$ there is no spread, and hence $\gamma(0) = 0$. In most practical
cases $\gamma(h) \to \sigma^2$ as $h \to \infty$ where $\sigma^2$ is the
maximum variance of the process. When this maximum variance exists, we
can write the following relation:

$$
\gamma(h) = \sigma^2 - cov(h) 
$$

```{julia}
#| echo: false
xs = range(0, stop=10, length=100)
ys = 1 .- exp.(-xs)
Mke.lines(xs, ys, axis = (xlabel="h", ylabel="γ(h)"), color="black")
```

where $cov(h)$ is the covariance function, a version of the correlogram 
function that is scaled by the standard deviations of X and Y:

$$
cor(h) = \frac{cov(h)}{\sigma_x \sigma_y}
$$

Explaining why the variogram is more general than the covariance is out
of scope for this book, but it has to do with the fact that variograms
operate on the "difference process" $(x_i - x_j)$ as opposed to the centered
process $(x_i - \bar{x})$. In particular, it does not require a finite maximum
variance $\sigma^2$.

::: {.callout-note}

The theory of intrinsic random functions of order k (IRF-k) is an advanced concept
from geostatistical theory that explains the generality of the variogram function
[@Chiles2012].

:::

Our main goal here is to gain intuition about the variogram function for interpolation
purposes. It suffices to learn its four basic elements: **range**, **sill**, **nugget** and **model**.

```{julia}
#| echo: false
r = 10.0
s = 1.0
n = 0.1
m = GaussianVariogram
g = m(range=r, sill=s, nugget=n)

fig = Mke.Figure()
Mke.Axis(fig[1,1], xlabel="h", ylabel="γ(h)", limits = (nothing, nothing, 0, 1.1))
Mke.plot!(g, maxlag=25, label="model")
Mke.vlines!(r, color="black", linestyle=:dash, label="range")
Mke.hlines!(s, color="teal", linestyle=:dash, label="sill")
Mke.hlines!(n, color="teal", linestyle=:dot, label="nugget")
Mke.axislegend("Elements", position = :rc)
fig
```

### range

The **range** (a.k.a. correlation length) of the variogram determines the average
size of "blobs" in the image. Let's consider two synthetic images with ranges
10 and 30, respectively:

```{julia}
#| echo: false
g = CartesianGrid(100, 25)

p1 = GaussianProcess(GaussianVariogram(range=10.0))
p2 = GaussianProcess(GaussianVariogram(range=30.0))

Random.seed!(2023)
d1 = rand(p1, g, [:Z => Float64], LUMethod())
d2 = rand(p2, g, [:Z => Float64], LUMethod())

fig = Mke.Figure()
viz(fig[1,1], d1.geometry, color = d1.Z, axis = (; title="range: 10"))
viz(fig[2,1], d2.geometry, color = d2.Z, axis = (; title="range: 30"))
fig
```

In the first image, we can clearly visualize the average size of yellow and blue blobs
around 10 pixels (i.e. quadrangles). In the second image, the blobs have an average size
of 30 pixels, which is greater than one of the sides of the grid (100x25 pixels).

### sill

To understand the **sill** of the variogram, let's consider a 1D grid as our domain, and
let's represent the values of the variable with height instead of color:

```{julia}
#| echo: false
g = CartesianGrid(100)

p1 = GaussianProcess(GaussianVariogram(range=10.0, sill=1.0))
p2 = GaussianProcess(GaussianVariogram(range=10.0, sill=9.0))

Random.seed!(2023)
d1 = rand(p1, g, [:Z => Float64], LUMethod())
d2 = rand(p2, g, [:Z => Float64], LUMethod())

f(x) = ustrip(first(to(centroid(x))))

fig = Mke.Figure()
Mke.lines(fig[1,1], f.(d1.geometry), d1.Z, color = "black", axis = (; title="sill: 1"))
Mke.lines(fig[2,1], f.(d2.geometry), d2.Z, color = "black", axis = (; title="sill: 9"))
fig
```

The **sill** determines the maximum variance of the process. If the sill is $\sigma^2$,
then a process with mean $\mu$ will oscillate within $\mu\pm3\sigma$ with 99.7% probability.
The vertical amplitude in the second plot is (3x) larger than that of the first plot.
In both plots, we have $\mu=0$. 

### nugget

The **nugget** can be used to insert additional variance at scales that are smaller than
the scale of measurements (i.e. pixel). It is known in the image processing literature as
[salt-and-pepper noise](https://en.wikipedia.org/wiki/Salt-and-pepper_noise):

```{julia}
#| echo: false
g = CartesianGrid(100)

p1 = GaussianProcess(GaussianVariogram(range=10.0, nugget=0.1))
p2 = GaussianProcess(GaussianVariogram(range=10.0, nugget=0.2))

Random.seed!(2023)
d1 = rand(p1, g, [:Z => Float64], LUMethod())
d2 = rand(p2, g, [:Z => Float64], LUMethod())

f(x) = ustrip(first(to(centroid(x))))

fig = Mke.Figure()
Mke.lines(fig[1,1], f.(d1.geometry), d1.Z, color = "black", axis = (; title="nugget: 0.1"))
Mke.lines(fig[2,1], f.(d2.geometry), d2.Z, color = "black", axis = (; title="nugget: 0.2"))
fig
```

We can visualize the nugget effect in our 2D grid with colors as before:

```{julia}
#| echo: false
g = CartesianGrid(100, 25)

p1 = GaussianProcess(GaussianVariogram(range=10.0, nugget=0.0))
p2 = GaussianProcess(GaussianVariogram(range=10.0, nugget=0.1))

Random.seed!(2023)
d1 = rand(p1, g, [:Z => Float64], LUMethod())
Random.seed!(2023)
d2 = rand(p2, g, [:Z => Float64], LUMethod())

fig = Mke.Figure()
viz(fig[1,1], d1.geometry, color = d1.Z, axis = (; title="nugget: 0.0"))
viz(fig[2,1], d2.geometry, color = d2.Z, axis = (; title="nugget: 0.1"))
fig
```

::: {.callout-note}

The name "nugget" comes from gold nuggets in mining geostatistics.
These are often much smaller than selective mining units (SMUs), and
show as bright values in the 3D model of the mineral deposit.

:::

### model

Finally, the **model** of the variogram determines how the function increases
near the origin. The GeoStats.jl framework provides dozens of such models of
geospatial correlation. The most widely used are the `GaussianVariogram`, the
`SphericalVariogram` and the `ExponentialVariogram`:

```{julia}
#| echo: false
γ1 = GaussianVariogram()
γ2 = SphericalVariogram()
γ3 = ExponentialVariogram()

fig = Mke.Figure()
Mke.Axis(fig[1,1])
Mke.plot!(γ1, maxlag=2, vcolor = "teal", label = "Gaussian")
Mke.plot!(γ2, maxlag=2, vcolor = "slategray3", label = "Spherical")
Mke.plot!(γ3, maxlag=2, vcolor = "brown", label = "Exponential")
Mke.axislegend("Model", position = :rb)
fig
```

The faster is the increase of the function near the origin, the more "erratic" is the process:

```{julia}
#| echo: false
g = CartesianGrid(100, 25)

p1 = GaussianProcess(GaussianVariogram(range=10.0))
p2 = GaussianProcess(SphericalVariogram(range=10.0))
p3 = GaussianProcess(ExponentialVariogram(range=10.0))

Random.seed!(2023)
d1 = rand(p1, g, [:Z => Float64], LUMethod())
d2 = rand(p2, g, [:Z => Float64], LUMethod())
d3 = rand(p3, g, [:Z => Float64], LUMethod())

fig = Mke.Figure()
viz(fig[1,1], d1.geometry, color = d1.Z, axis = (; title="model: Gaussian"))
viz(fig[2,1], d2.geometry, color = d2.Z, axis = (; title="model: Spherical"))
viz(fig[3,1], d3.geometry, color = d3.Z, axis = (; title="model: Exponential"))
fig
```

All the four elements of the variogram function can be easily set at construction time:

```{julia}
γ = GaussianVariogram(range=10.0, sill=2.0, nugget=0.1)
```

And queried later with the corresponding functions:

```{julia}
range(γ), sill(γ), nugget(γ)
```

We can evaluate the variogram at any given lag:

```{julia}
γ(1.0)
```

Or evaluate the variogram between any two points:

```{julia}
γ(Point(0, 0), Point(1, 0))
```

In this case, the `Euclidean` metric is used by default to compute the lag.
More generally, we can evaluate the variogram between any two geometries:

```{julia}
γ(Point(0, 0), Triangle((0, 0), (1, 0), (1, 1)))
```

::: {.callout-note}

The evaluation of the variogram function between two geometries is known as
variogram regularization, and implemented in terms of numerical integration.

:::

Remind that the variogram value $\gamma(h)$ is a measure of spread in the
`hscatter` plot. It tells how much variation is expected for a variable
at a distance $h$ from a reference point.

## Fitting models

Given geospatial data, how do we fit an appropriate variogram model for it?
This practical question is traditionally answered in two steps as follows.

### Empirical estimate

Let's recap the synthetic image from the beginning of the chapter:

```{julia}
img |> viewer
```

We can use the `EmpiricalVariogram` to estimate the function at specific
lag values:

```{julia}
g = EmpiricalVariogram(img, :Z, maxlag = 50.0)
```

```{julia}
Mke.plot(g)
```

::: {.callout-note}

The numbers and bars in the empirical variogram plot represent the number
of pairs used to estimate the value of the variogram at the corresponding bin.
The larger the number, the more confident we can be in the estimate.

:::

The `DirectionalVariogram` can be used to estimate the function along
specific directions:

```{julia}
gₕ = DirectionalVariogram((1.0, 0.0), img, :Z, maxlag = 50.0)
gᵥ = DirectionalVariogram((0.0, 1.0), img, :Z, maxlag = 50.0)

Mke.plot(gₕ, hshow = false, vcolor = "maroon")
Mke.plot!(gᵥ, hshow = false, vcolor = "slategray")
Mke.current_figure()
```

In this example, we observe that the blobs are elongated with a horizontal
range of 30 pixels and a vertical range of 10 pixels. This is known as
geometric **anisotropy**.

We can also estimate the variogram in all directions on a plane with the
`EmpiricalVarioplane`:

```{julia}
gₚ = EmpiricalVarioplane(img, :Z, maxlag = 50.0)
```

The varioplane is usually plotted on a polar axis to highlight the different
ranges as a function of the polar angle. These ranges are illustrated with a
solid curve over the heatmap:

```{julia}
fig = Mke.Figure()
ax = Mke.PolarAxis(fig[1,1])
Mke.plot!(ax, gₚ)
fig
```

::: {.callout-note}

The book by @Webster2007 and the article by @Cressie1980 are good resources
to learn more about robust variogram estimation.

:::

### Least-squares fit

After empirical variogram estimation, the next step consists of fitting a
theoretical model. This step is necessary for interpolation given that we need
to be able to evaluate the variogram function at any lag $h$, not just the
specific lags of the empirical variogram.

::: {.callout-note}

Another reason to fit theoretical models is to ensure that variances of linear
combinations of variables are always non-negative as discussed in @Myers1992.

:::

To fit a specific theoretical model, we can use the `fit` function with the
model as the first argument:

```{julia}
GeoStatsFunctions.fit(SphericalVariogram, g)
```

We can also let the framework select the model with minimum weighted least-squares
error by passing the generic `Variogram` model to the function:

```{julia}
γ = GeoStatsFunctions.fit(Variogram, g)
```

## Remarks

This chapter is definitely one of the most challenging ones for those with little
background in geostatistics. Let's make a few important remarks to summarize what
we learned:

- Geospatial correlation can be represented with different functions, including
  the **correlogram**, the **covariance** and the **variogram** functions. Among
  these functions the variogram is the most general and easy to interpret as a
  measure of "spread" in the `hscatter` plot.
- The variogram value $\gamma(h)$ represents the expected variation of a variable
  that is $h$ units of distance from a reference point. It usually starts at
  $\gamma(0) = 0$, reaches a **sill** value $\sigma^2$ near the **range** and
  stays at this value as $h \to \infty$.
- The selection of an appropriate theoretical variogram model for interpolation of
  geospatial data is often based on a two-step procedure. First, we estimate the
  `EmpiricalVariogram`, and then we `fit` a theoretical model. The most widely
  used models are the `GaussianVariogram`, the `SphericalVariogram` and the
  `ExponentialVariogram`.

In the next chapter, we will learn how to perform **geospatial interpolation** with
the selected theoretical variogram model.