-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME.Rmd
232 lines (174 loc) · 9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: "README"
output:
github_document:
toc: true
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r opts_chunk, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
# fig.path = "README-"
fig.path = "tools/README-"
)
set.seed(0)
```
# GauPro
<!-- badges: start -->
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/GauPro)](https://cran.r-project.org/package=GauPro)
[![codecov](https://codecov.io/gh/CollinErickson/GauPro/branch/master/graph/badge.svg)](https://app.codecov.io/gh/CollinErickson/GauPro)
[![R-CMD-check](https://github.com/CollinErickson/GauPro/workflows/R-CMD-check/badge.svg)](https://github.com/CollinErickson/GauPro/actions)
[![R-CMD-check](https://github.com/CollinErickson/GauPro/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/CollinErickson/GauPro/actions/workflows/R-CMD-check.yaml)
[![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/last-month/GauPro?color=blue)](https://r-pkg.org/pkg/GauPro)
[![CRAN checks](https://badges.cranchecks.info/summary/GauPro.svg)](https://cran.r-project.org/web/checks/check_results_GauPro.html)
<!-- badges: end -->
## Overview
This package allows you to fit a Gaussian process regression model to a dataset.
A Gaussian process (GP) is a commonly used model in computer simulation.
It assumes that the distribution of any set of points is multivariate
normal.
A major benefit of GP models is that they provide uncertainty estimates along
with their predictions.
## Installation
You can install like any other package through CRAN.
```
install.packages('GauPro')
```
The most up-to-date version can be downloaded from
my Github account.
```
# install.packages("devtools")
devtools::install_github("CollinErickson/GauPro")
```
## Example in 1-Dimension
This simple shows how to fit the Gaussian process regression model to data.
The function `gpkm` creates a Gaussian process kernel model fit to the
given data.
```{r, libraryGauPro}
library(GauPro)
```
```{r, fitsine}
n <- 12
x <- seq(0, 1, length.out = n)
y <- sin(6*x^.8) + rnorm(n,0,1e-1)
gp <- gpkm(x, y)
```
Plotting the model helps us understand how accurate the model is and how much
uncertainty it has in its predictions.
The green and red lines are the 95% intervals for the mean and for samples,
respectively.
```{r, plotsine}
gp$plot1D()
```
## Factor data: fitting the `diamonds` dataset
The model fit using `gpkm` can also be used with data/formula input and
can properly handle factor data.
In this example, the `diamonds` data set is fit by specifying the formula
and passing a data frame with the appropriate columns.
```{r fit_dm}
library(ggplot2)
diamonds_subset <- diamonds[sample(1:nrow(diamonds), 60), ]
dm <- gpkm(price ~ carat + cut + color + clarity + depth,
diamonds_subset)
```
Calling `summary` on the model gives details about the model, including
diagnostics about the model fit and the relative importance of the features.
```{r summary_dm}
summary(dm)
```
We can also plot the model to get a visual idea of how each input affects the
output.
```{r plot_dm}
plot(dm)
```
### Constructing a kernel
In this case, the kernel was chosen automatically by looking at which dimensions
were continuous and which were discrete, and then using a Matern 5/2 on the
continuous dimensions (1,5), and separate ordered factor kernels on the other
dimensions since those columns in the data frame are all ordinal factors.
We can construct our own kernel using products and sums of any kernels,
making sure that the continuous kernels ignore the factor dimensions.
Suppose we want to construct a kernel for this example that uses the power
exponential kernel for the two continuous dimensions, ordered kernels on
`cut` and `color`, and a Gower kernel on `clarity`.
First we construct the power exponential kernel that ignores the 3 factor
dimensions.
Then we construct
```{r diamond_construct_kernel}
cts_kernel <- k_IgnoreIndsKernel(k=k_PowerExp(D=2), ignoreinds = c(2,3,4))
factor_kernel2 <- k_OrderedFactorKernel(D=5, xindex=2, nlevels=nlevels(diamonds_subset[[2]]))
factor_kernel3 <- k_OrderedFactorKernel(D=5, xindex=3, nlevels=nlevels(diamonds_subset[[3]]))
factor_kernel4 <- k_GowerFactorKernel(D=5, xindex=4, nlevels=nlevels(diamonds_subset[[4]]))
# Multiply them
diamond_kernel <- cts_kernel * factor_kernel2 * factor_kernel3 * factor_kernel4
```
Now we can pass this kernel into `gpkm` and it will use it.
```{r diamond_construct_kernel_fit}
dm <- gpkm(price ~ carat + cut + color + clarity + depth,
diamonds_subset, kernel=diamond_kernel)
dm$plotkernel()
```
## Using kernels
A key modeling decision for Gaussian process models is the choice of kernel.
The kernel determines the covariance and the behavior of the model.
The default kernel is the Matern 5/2 kernel (`Matern52`), and is a good
choice for most cases.
The Gaussian, or squared exponential, kernel (`Gaussian`) is a common choice
but often leads to a bad fit since it assumes the process the data comes
from is infinitely differentiable.
Other common choices that are available include the `Exponential`,
Matern 3/2 (`Matern32`),
Power Exponential (`PowerExp`),
`Cubic`,
Rational Quadratic (`RatQuad`),
and Triangle (`Triangle`).
These kernels only work on numeric data.
For factor data, the kernel will default to a
Latent Factor Kernel (`LatentFactorKernel`) for character and unordered factors,
or an Ordered Factor Kernel (`OrderedFactorKernel`) for ordered factors.
As long as the input is given in as a data frame and the columns have the
proper types, then the default kernel will properly handle it by applying
the numeric kernel to the numeric inputs and
the factor kernel to the factor and character inputs.
Kernels are stored as R6 objects.
They can all be created using the R6 object generator (e.g., `Matern52$new()`),
or using the `k_<kernel name>` shortcut function (e.g., `k_Matern52()`).
The latter is easier to use (and recommended) since R will show the function
arguments and autocomplete.
The following table shows details on all the kernels available.
| Kernel | Function | Continuous/<br />discrete | Equation | Notes |
| --- | --- | --- | --- | --- |
| Gaussian | `k_Gaussian` | cts | | Often causes issues since it assumes infinite differentiability. Experts don't recommend using it. |
| Matern 3/2 | `k_Matern32` | cts | | Assumes one time differentiability. This is often too low of an assumption. |
| Matern 5/2 | `k_Matern52` | cts | | Assumes two time differentiability. Generally the best. |
| Exponential | `k_Exponential` | cts | | Equivalent to Matern 1/2. Assumes no differentiability. |
| Triangle | `k_Triangle` | cts | | |
| Power exponential | `k_PowerExp` | cts | | |
| Periodic | `k_Periodic` | cts | $k(x, y) = \sigma^2 * \exp(-\sum(\alpha_i*sin(p * (x_i-y_i))^2))$ | The only kernel that takes advantage of periodic data. But there is often incoherence far apart, so you will likely want to multiply by one of the standard kernels. |
| Cubic | `k_Cubic` | cts | | |
| Rational quadratic | `k_RatQuad` | cts | | |
| Latent factor kernel | `k_LatentFactorKernel` | factor | | This embeds each discrete value into a low dimensional space and calculates the distances in that space. This works well when there are many discrete values. |
| Ordered factor kernel | `k_OrderedFactorKernel` | factor | | This maintains the order of the discrete values. E.g., if there are 3 levels, it will ensure that 1 and 2 have a higher correlation than 1 and 3. This is similar to embedding into a latent space with 1 dimension and requiring the values to be kept in numerical order. |
| Factor kernel | `k_FactorKernel` | factor | | This fits a parameter for every pair of possible values. E.g., if there are 4 discrete values, it will fit 6 (4 choose 2) values. This doesn't scale well. When there are many discrete values, use any of the other factor kernels. |
| Gower factor kernel | `k_GowerFactorKernel` | factor | $k(x,y) = \begin{cases} 1, & \text{if } x=y \\ p, & \text{if } x \neq y \end{cases}$ | This is a very simple factor kernel. For the relevant dimension, the correlation will either be 1 if the value are the same, or $p$ if they are different. |
| Ignore indices | `k_IgnoreIndsKernel` | N/A | | Use this to create a kernel that ignores certain dimensions. Useful when you want to fit different kernel types to different dimensions or when there is a mix of continuous and discrete dimensions. |
Factor kernels: note that these all only work on a single dimension. If there
are multiple factor dimensions in your input, then they each will be given a
different factor kernel.
## Combining kernels
Kernels can be combined by multiplying or adding them directly.
The following example uses the product of a periodic and a Matern 5/2 kernel
to fit periodic data.
```{r combine seed, include=F}
set.seed(99)
```
```{r combine_periodic}
x <- 1:20
y <- sin(x) + .1*x^1.3
combo_kernel <- k_Periodic(D=1) * k_Matern52(D=1)
gp <- gpkm(x, y, kernel=combo_kernel, nug.min=1e-6)
gp$plot()
```
For an example of a more complex kernel being constructed,
see the diamonds section above.