-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.Rmd
285 lines (216 loc) · 11.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
---
output: github_document
---
<!-- README.md is generated from README.rmd. Please edit that file -->
```{r, echo=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/"
)
library(plt)
```
# plt
Persistence landscapes are a vectorization of persistence data/diagrams that have useful statistical properties including linearity and an inner product.[^pl]
This is an R package interface to a C++ library to efficienctly compute and calculate with persistence landscapes.[^plt]
[^pl]: Bubenik P (2015) "Statistical Topological Data Analysis using Persistence Landscapes". _Journal of Machine Learning Research_ **16**(3):77--102. <https://jmlr.csail.mit.edu/papers/v16/bubenik15a.html>
[^plt]: Bubenik P & Dłotko P (2017) "A persistence landscapes toolbox for topological statistics". _Journal of Symbolic Computation_ **78**(1):91--114. <https://www.sciencedirect.com/science/article/pii/S0747717116300104>
## Installation
Until the package is on CRAN, use **pak** to install the package from the GitHub repository as follows:
```{r, eval=FALSE}
install.packages("pak")
pak::pkg_install("corybrunson/plt")
```
Alternatively---and especially if you want to contribute---you can clone or download the code repository and, from within the directory, install the package from source:
```{r, eval=FALSE}
devtools::install()
```
You should now be able to load the package normally from an R session:
```{r, eval=FALSE}
library(plt)
```
## Quickstart Guide
The **plt** package supports various operations involving persistence landscapes:
* Compute persistence landscapes from persistence data
* Plot persistence landscapes
* Perform Hilbert space operations (scaling, addition, inner product) plus some additional queries and transformations (extremal values, absolute value, integration) on persistence landscapes
* Conduct hypothesis tests on samples of persistence landscapes
* Vectorize persistence landscapes for other purposes including machine learning
Examples and tests in **plt** rely on other packages to simulate data and to compute persistence diagrams from data:
* **tdaunif** provides functions to sample uniformly from various immersed manifolds.
* **ripserr** and **TDA** provide functions to compute persistence data from point clouds and distance matrices.
**plt** introduces the 'Rcpp_PersistenceLandscape' S4 class, which is exposed using **Rcpp** from the underlying 'PersistenceLandscape' C++ class.
Instances of this class can be created using `new()` but the recommended way is to use `pl_new()`.
This function accepts either a single matrix of persistence data or a specially formatted list with the class `'persistence_diagram"`.
The `$pairs` entry of the list is itself a list, of a 2-column matrix of persistence pairs for each homological degree from 0 (`$pairs[[1]]`) to the maximum degree calculated.
The generic converter `as_persistence()` includes methods for outputs from `ripserr::vietoris_rips()` and from `TDA::*Diag()`; it operates under the hood of `pl_new()`, but we invoke it explicitly here for illustration.
### Calculation
To begin an illustration, we noisily sample 60 points from a figure eight and compute the persistence diagram of the point cloud:
```{r}
set.seed(513611L)
pc <- tdaunif::sample_lemniscate_gerono(60, sd = .1)
plot(pc, asp = 1, pch = 16L)
pd <- ripserr::vietoris_rips(pc, dim = 1, threshold = 2, p = 2)
print(pd)
```
We the convert the persistence data to the preferred persistence diagram format and inspect some of its features:
```{r}
pd <- as_persistence(pd)
print(pd)
print(head(pd$pairs[[1]]))
print(head(pd$pairs[[2]]))
```
This allows us to compute a persistence landscape---in this case, for the 1-dimensional features.
Here we compute the landscape exactly, which can be cost-prohibitive for larger persistence data, and print its summary:
```{r}
pl1 <- pl_new(pd, degree = 1, exact = TRUE)
print(pl1)
summary(pl1)
```
Some advanced concepts like the magnitude of a landscape will be explained below.
### Class
The object `pl1` is not an array, but rather an object that encapsulates both the data that encode a landscape and several basic operations that can be performed on it.
This allows us to work with persistence landscapes without worrying about pre-processing their representations.
At any point, the underlying encoding of the landscape can be extracted using `$getInternal()`, which in the case of an exactly calculated landscape returns a list of 2-column matrices, each matrix containing coordinates that define one level of the landscape as a piecewise linear function:
```{r}
print(length(pl1$getInternal()))
print(pl1$getInternal())
```
An alternative, approximate construction computes the value of each level of the landscape at each point on a 1-dimensional grid, ranging from `xmin` to `xmax` at increments of `xby`.
A landscape constructed using a discrete approximation is stored as a 3-dimensional array of dimensions (levels, values, 2), with one level per feature (some of which may be trivial) and one value per grid point, stored as $x,y$ pairs along the third dimension.
```{r}
b_ran <- pl_support(pl1)
pl1d <- pl_new(pd, degree = 1,
xmin = b_ran[[1L]], xmax = b_ran[[2L]], xby = 0.05)
print(dim(pl1d$getInternal()))
print(pl1d$getInternal())
```
Exactly computed landscapes can be converted to discrete landscape objects, but the other direction is not well-defined.
Below, we view a portion of the discretized exact landscape, using the default bounds and resolution given to `pl1`:
```{r}
# default conversion to discrete uses `xby = 0.001`
pl1_ <- pl1$discretize()
print(dim(pl1_$getInternal()))
# print first 12 x-coordinates
pl1_$getInternal()[, seq(230L, 270L), , drop = FALSE]
```
We can also specify the bounds and the resolution of the discretization:
```{r}
pl1 <- pl_delimit(pl1, xmin = 0, xmax = 1, xby = 0.1)
pl1_ <- pl_discretize(pl1)
pl1_$getInternal()
```
### Visualization
**plt** provides a `plot()` method for the 'Rcpp_PersistenceLandscape' class.
It uses **grDevices** to build color palettes, and as such its default palette is viridis; but the user may supply the name of a recognized palette or a sequence of colors between which to interpolate:
```{r}
n_levs <- max(pl_num_levels(pl1), pl_num_levels(pl1d))
par(mfrow = c(2L, 1L), mar = c(2, 2, 0, 2))
plot(pl1, palette = "terrain", n_levels = n_levs, asp = 1)
plot(pl1d, palette = "terrain", n_levels = n_levs, asp = 1)
par(mfrow = c(1L, 1L), mar = c(5.1, 4.1, 4.1, 2.1))
```
### Hilbert Space Operations
To illustrate these features, we first generate a companion data set:
```{r}
# a new landscape and its discretization
set.seed(772888L)
pc2 <- tdaunif::sample_circle(60, sd = .1) / 2
pd2 <- ripserr::vietoris_rips(pc2, dim = 1, threshold = 2, p = 2)
pl2 <- pl_new(pd2, degree = 1, exact = TRUE)
pl2 <- pl_delimit(pl2, xmin = 0, xmax = 2, xby = 0.1)
pl2_ <- pl_discretize(pl2)
```
Several infix operators have been taught to work in the natural way with landscapes, so that users can explore vector space operations and inner products more conveniently:
```{r}
par(mfcol = c(3L, 2L), mar = c(2, 2, 0, 2))
# vector space operations on exact landscapes
plot(pl1 * 2)
plot(pl2)
plot(pl1 * 2 + pl2)
# vector space operations on discrete landscapes
plot(pl1_)
plot(-pl2_)
plot(2 * pl1_ - pl2_)
par(mfrow = c(1L, 1L), mar = c(5.1, 4.1, 4.1, 2.1))
# inner products of exact and discrete landscapes
pl1 %*% pl2
pl1_ %*% pl2_
pl1 %*% pl2_
```
(Note that the landscapes are automatically delimited to a compatible domain.)
### Calculus
The `summary()` method above reported the magnitude and the integral of the persistence landscape `pl1`.
The magnitude is the inner product of `pl1` with itself: `pl1 %*% pl1 = `r pl1 %*% pl1``.
Meanwhile, the integral is the (signed) area under the curve, itself also a linear operator:
```{r}
pl_integrate(pl1)
pl_integrate(pl2)
# 1-integral obeys linearity
pl_integrate(pl1) * 2 - pl_integrate(pl2)
pl_integrate(pl1 * 2 - pl2)
```
The distance between two landscapes is defined in terms of the integral of their absolute difference for finite norms and the maximum pointwise distance for the infinite norm.
Note that, because `pl_integrate()` defaults to the 1-norm and `pl_distance()` defaults to the 2-norm, we must be careful when comparing their results:
```{r}
# using the 1-norm
pl_integrate(pl_abs(pl1 * 2 - pl2), p = 1)
pl_distance(pl1 * 2, pl2, p = 1)
# using the 2-norm
pl_integrate(pl_abs(pl1 * 2 - pl2), p = 2) ^ (1/2)
pl_distance(pl1 * 2, pl2, p = 2)
# using the infinity norm
pl_vmax(pl_abs(pl1 * 2 - pl2))
pl_distance(pl1 * 2, pl2, p = Inf)
```
The norm of a persistence landscape is then defined as its distance from the null landscape that is constant at zero.
```{r}
# null landscape
pd0 <- data.frame(start = double(0L), end = double(0L))
pl0 <- pl_new(pd0, degree = 1, exact = TRUE)
pl_distance(pl1, pl0)
pl_norm(pl1)
```
### Statistical Analysis
Finally, **plt** implements the two hypothesis tests described in the original paper.
To illustrate, we first generate lists of landscapes for samples from two spaces:
```{r}
# samples of landscapes from lemniscates
pl1s <- replicate(6, {
pc <- tdaunif::sample_lemniscate_gerono(60, sd = .1)
pd <- ripserr::vietoris_rips(pc, dim = 1, threshold = 2, p = 2)
pl_new(pd, degree = 1, xby = .01)
})
# samples of landscapes from circles
pl2s <- replicate(8, {
pc <- tdaunif::sample_circle(60, sd = .1) / 2
pd <- ripserr::vietoris_rips(pc, dim = 1, threshold = 2, p = 2)
pl_new(pd, degree = 1, xby = .01)
})
```
An inspection of the mean landscapes makes clear that they are distinct:
```{r}
# average landscape from each sample
par(mfcol = c(2L, 1L), mar = c(2, 2, 0, 2))
plot(pl_mean(pl1s))
plot(pl_mean(pl2s))
par(mfrow = c(1L, 1L), mar = c(5.1, 4.1, 4.1, 2.1))
```
However, the two hypothesis tests use different procedures and rely on different test statistics, so one may be more effective than another.
For convenience, both methods return objects of class `'htest'` for convenient printing:
```{r}
# z-test of difference in integrals of first level
pl_z_test(pl1s, pl2s)
# permutation test of pairwise distances between landscapes
pl_perm_test(pl1s, pl2s)
```
## Acknowledgments
### Precursors
The C++ library is adapted from [Paweł Dłotko's Persistence Landscape Toolbox](https://www2.math.upenn.edu/~dlotko/persistenceLandscape.html).
It was originally adapted and ported to R in [Jose Bouza's **tda-tools** package](https://github.com/jjbouza/tda-tools).
### Resources
Development of this package benefitted from the use of equipment and the support of colleagues at the [University of Florida](https://www.ufl.edu/), especially [Peter Bubenik's research group](https://people.clas.ufl.edu/peterbubenik/researchgroup/) and the [Laboratory for Systems Medicine](https://systemsmedicine.pulmonary.medicine.ufl.edu/).
### Contribute
Bug reports, unit tests, documentation, use cases, feature suggestions, and other contributions are welcome. See the [CONTRIBUTING](https://github.com/corybrunson/plt/blob/main/CONTRIBUTING.md) file for guidance, and please respect the [Code of Conduct](https://github.com/corybrunson/plt/blob/main/CODE_OF_CONDUCT.md).
### Cite
If you use **plt** in published work, please include a citation following `citation("plt")`.