-
-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathengine.qmd
575 lines (402 loc) · 34.8 KB
/
engine.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
```{r,echo=FALSE,message=FALSE,warning=FALSE}
library(lidR)
library(stars)
r3dDefaults = rgl::r3dDefaults
m = structure(c(0.921, -0.146, 0.362, 0, 0.386, 0.482, -0.787, 0,
-0.06, 0.864, 0.5, 0, 0, 0, 0, 1), .Dim = c(4L, 4L))
r3dDefaults$FOV = 50
r3dDefaults$userMatrix = m
r3dDefaults$zoom = 0.75
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
fig.align = "center")
options(crayon.enabled = TRUE)
rgl::setupKnitr(autoprint = TRUE)
#old_hooks <- fansi::set_knit_hooks(knitr::knit_hooks, which = c("output", "message", "error"))
```
# `LAScatalog` processing engine (1/2) {#sec-engine}
## Rationale for `LAScatalog` functionality {#sec-engine-rationale}
So far we have demonstrated how to process a point cloud file using `readLAS()`. In practice, ALS acquisitions are made up of hundreds or even thousands of files, not being feasible (or possible!) for all to be loaded at once into memory. For example, the image below shows an acquisition split into 320 1 km² tiles:
```{r plot-catalog-hali, echo=FALSE, warning=FALSE}
haliburton <- readRDS("data/ENGINE/haliburton.lascatalog")
st_crs(haliburton) <- 32617
plot(haliburton)
```
The extraction of regions of interest (ROIs) such as plot inventories becomes difficult in these situations because one must search in which file and sometimes which file**s** the ROI belongs. It also makes the creation of continuous outputs such as a DTM, a raster of metrics, individual tree detection, or anything else far more complicated. One may be tempted to loop though individual files (and we have seen many users doing so) like so:
``` r
files <- list.files("folder/")
for (file in files) {
las <- readLAS(file)
output <- do.something(las)
write(output, "path/to/output.ext")
}
```
This however is very bad practice because each file is loaded without a buffer, meaning outputs will be invalid or weak at the edges because of the absence of spatial context.
Let's compute a DTM (see @sec-dtm) as an example, where we use a `for` loop on 4 contiguous files. We can see the obvious invalidity of the DTM between the files, but also at the periphery, even if less obvious.
```{r plot-dtm-artifact, echo=FALSE, fig.height=7.9, fig.width=8, warning=FALSE}
dtms <- stars::read_stars("data/ENGINE/DTMs.grd")
plot(dtms[,,,1], axes = T, breaks = "equal", downsample = 1)
```
For comparison, see the accurate DTM below, that was generated with the `LAScatalog` processing engine. The benefit here is the ability to manage (among a variety of other capabilities) on-the-fly buffering.
```{r plot-dtm-no-artifact, echo=FALSE, fig.height=7.9, fig.width=8, warning = FALSE}
plot(dtms[,,,2], axes = T, breaks = "equal", downsample = 1)
```
## The `LAScatalog` engine {#sec-engine-engine}
The `lidR` package provides a powerful and versatile set of tools to work with collections of files and enables the application of workflow with an automatic management of tile buffering, on disk storage, parallelization of tasks and so on. While the engine is powerful and versatile, it's also a complex tool, so we have dedicated two section to its description.
This section presents the justification for the engine and demonstrates how to use it by applying common `lidR` functions seen in previous section. The next section goes deeper into the engine to demonstrate how developers can leverage the internal functionality of `lidR` to create their own applications.
The `LAScatalog` class and the `LAScatalog` engine are intricately documented in two dedicated vignettes available [here](https://cran.r-project.org/web/packages/lidR/vignettes/lidR-LAScatalog-class.html) and [here](https://cran.r-project.org/web/packages/lidR/vignettes/lidR-LAScatalog-engine.html). The purpose of this book is to propose alternative documentation with more illustrated examples and real use cases.
## Read a collection of files {#sec-engine-lascatalog}
The function `readLAScatalog()` reads the header of all the `LAS` files of a given folder. The header of a `LAS` file contains, among others, the bounding box of the point cloud, meaning that it's possible to know where the file is situated spatially without reading a potentially massive amount of data. `readLAScatalog()` builds a `sf/sfc_POLYGON` out of the bounding boxes of the files and presents them in an easily accessible manner.
``` r
ctg <- readLAScatalog("path/to/folder/")
```
```{r, echo = FALSE}
ctg <- haliburton
```
```{r plot-catalog-hali-2}
ctg
plot(ctg)
```
In this and the following chapter (@sec-engine2) we will use a collection of nine 500 x 500 m files to create examples. The nine files can be downloaded here:
1. [tiles_338000_5238000.laz](https://github.com/r-lidar/lidRbook/raw/master/data/ENGINE/catalog/tiles_338000_5238000_1.laz)
2. [tiles_338000_5238500.laz](https://github.com/r-lidar/lidRbook/raw/master/data/ENGINE/catalog/tiles_338000_5238500_1.laz)
3. [tiles_338000_5239000.laz](https://github.com/r-lidar/lidRbook/raw/master/data/ENGINE/catalog/tiles_338000_5239000_1.laz)
4. [tiles_338500_5238000.laz](https://github.com/r-lidar/lidRbook/raw/master/data/ENGINE/catalog/tiles_338500_5238000_1.laz)
5. [tiles_338500_5238500.laz](https://github.com/r-lidar/lidRbook/raw/master/data/ENGINE/catalog/tiles_338500_5238500_1.laz)
6. [tiles_338500_5239000.laz](https://github.com/r-lidar/lidRbook/raw/master/data/ENGINE/catalog/tiles_338500_5239000_1.laz)
7. [tiles_339000_5238000.laz](https://github.com/r-lidar/lidRbook/raw/master/data/ENGINE/catalog/tiles_339000_5238000_1.laz)
8. [tiles_339000_5238500.laz](https://github.com/r-lidar/lidRbook/raw/master/data/ENGINE/catalog/tiles_339000_5238500_1.laz)
9. [tiles_339000_5239000.laz](https://github.com/r-lidar/lidRbook/raw/master/data/ENGINE/catalog/tiles_339000_5239000_1.laz)
```{r plot-ctg-book-1, fig.height=4, fig.width=4}
ctg <- readLAScatalog("data/ENGINE/catalog/")
ctg
plot(ctg)
```
```{r, echo=FALSE}
opt_progress(ctg) <- FALSE
```
## Validation {#sec-engine-asprs-compliance}
Similar to @sec-io, an important first step in ALS data processing is ensuring that your data is complete and valid. The `las_check()` function performs an inspection of `LAScatalog` objects for file consistency.
```r
las_check(ctg)
#> Checking headers consistency
#> - Checking file version consistency... ✓
#> - Checking scale consistency... ✓
#> - Checking offset consistency... ✓
#> - Checking point type consistency... ✓
#> - Checking VLR consistency... ✓
#> - Checking CRS consistency... ✓
#> [...]
```
For a deep (and longer) inspection of each file use `deep = TRUE`. This will sequentially load each file entirely. It's thus important to be sure you have enough memory to manage this.
## Extract Regions of interest {#sec-engine-clip}
### Extract a single ROI {#sec-engine-clip-single}
Functions using the `clip_*()` moniker are a good starting point for exploring the capabilities of the `LAScataglog` engine. `clip_*()` functions allow for the extraction of a region of interest (ROI) from a point cloud. The following example extracts a circle from a point cloud loaded into memory in a `LAS` object:
``` r
plot <- clip_circle(las, x = 338700, y = 5238650, radius = 15)
```
This can be extended to a `LAScatalog` seamlessly. The engine searches in which file(s) the ROI belongs and extracts corresponding regions from all applicable files. The output is a `LAS` object.
```{r plot-cliped-roi-1, rgl = TRUE, fig.height=4, fig.width=4}
roi <- clip_circle(ctg, x = 338700, y = 5238650, radius = 40)
plot(roi, bg = "white", size = 4)
```
### Multiple extractions {#sec-engine-clip-multiple}
Multiple extractions is also possible and is performed the same way by searching the corresponding files and then querying in each file no matter if the region of interest is situated in one or several files. The output becomes `list` of `LAS` objects.
```{r plot-ctg-plots, fig.height=4, fig.width=4}
x <- c(339348.8, 339141.9, 338579.6, 338520.8, 338110.0, 339385)
y <- c(5239254, 5238717, 5238169, 5239318, 5239247, 5239290)
r <- 40
plot(ctg)
points(x, y)
rois <- clip_circle(ctg, x, y, r)
rois[1:2]
```
## Modification of default behavior {#sec-engine-options}
When processing a `LAScatalog`, no matter with which function, it internally uses what we called the `LAScatalog` processing engine. This is what happened *under the hood* when using `clip_circle()` in above examples. The default behavior of the engine is set in such a way that it returns what is most likely to be expected by the users. However the behavior of the engine can be tuned to optimize processing. Engine options, at the user level, can be modified with `opt_*()` functions.
The goal of this section is to present these options and how they affect the behavior of the engine.
### Multiple extractions on disk {#sec-engine-options-ondisk}
The engine has the ability to write generated results to disk storage rather than keeping everything in memory. This option can be activated with `opt_output_files() <-` that is used to designate the path where files will be written to disk. It expects a templated filename so each written file will be attributed a unique name.
In the following example, several `LAS` files will be written to disk with names like `339348.8_5239254_1.las` (center coordinates from each file) and the function returns a `LAScatalog` object that references all the new files instead of a list of `LAS` object.
```{r plot-extracted-plots, fig.height=4, fig.width=4}
opt_output_files(ctg) <- paste0(tempdir(), "/{XCENTER}_{YCENTER}_{ID}")
rois <- clip_circle(ctg, x, y, r)
rois
plot(rois)
```
We can check the files that were written on disk and see that the names match the template.
```{r}
rois$filename
```
### Multiple extraction with point cloud indexation {#sec-engine-options-index}
Point cloud indexation is a topic covered by [this vignette](https://cran.r-project.org/web/packages/lidR/vignettes/lidR-computation-speed-LAScatalog.html). In short, `LAS` file indexation allows for faster queries when extracting ROIs in files. Under the hood `lidR` relies on `LASlib` to read files and inherits of the capability to leverage the use of [`LAX` files developed by Martin Isenburg](https://www.youtube.com/watch?v=FMcBywhPgdg). While extracting hundreds of plots from hundreds of files may take many seconds, the use of index files can reduce processing to few seconds.
### Summary
We have learned several things about the `LAScatalog` engine in this section:
1. Many functions of the package work the same either with a point cloud (`LAS`) or a collection of files (`LAScatalog`).
2. The behavior of the engine can be modified to write objects to disk instead of loading everything into memory.
3. The engine takes advantage of file indexes.
## Ground classification {#sec-engine-gnd-classification}
### Chunk processing {#sec-engine-options-chunk}
`classify_ground()` (see @sec-gnd) is an important function of `lidR`, which works seamlessly with the `LAScatalog` engine. An ALS acquisition is processed in pieces, referred to here as **chunks**. An acquisition is usually split into chunks where files correspond to a tiling pattern. Using `classify_ground()` within the `LAScatalog` engine will process each file sequentially. Given that we emphasize the importance of processing point clouds including a buffer, the engine loads a buffer on-the-fly around each tile before any processing is conducted.
The most basic implementation is very similar to examples in @sec-gnd. In this case we first specify where to write the outputs using `opt_output_files()`. The files written on disk will be `LAS` files, while the output in R will be a `LAScatalog`. If we don't write to disk, the result of each chunk will be stored into memory, potentially leading to memory issues.
``` r
opt_output_files(ctg) <- paste0(tempdir(), "{*}_classified")
classified_ctg <- classify_ground(ctg, csf())
```
In reality, R won't crash because the function won't allow users to classify a collection of files without providing a path to save outputs. Most functions in `lidR` check user inputs to mitigate issues.
```{r, error = TRUE}
opt_output_files(ctg) <- ""
classified_ctg <- classify_ground(ctg, csf())
```
### Modifying buffers {#sec-engine-options-chunk-buffer}
It is possible to modify the behavior of the engine by modifying the size of the buffer with `opt_chunk_buffer() <-`. The option `chunk = TRUE` of the function `plot()` allows visualization of how an option affects the processing pattern. The red boundaries show the chunks that will be sequentially processed and the green dotted lines show the extended regions used for buffering each chunk.
```{r plot-buffer-pattern, fig.show='hold', fig.height=4, fig.width=4}
opt_chunk_buffer(ctg) <- 20
plot(ctg, chunk = TRUE)
opt_chunk_buffer(ctg) <- 50
plot(ctg, chunk = TRUE)
```
Depending on the point cloud density and the processes applied, it might be necessary to increase the default buffer. Sometimes no buffer is needed at all. No matter the size, the buffered region is only useful to derive a spatial context that extends the actual size of the chunk. After processing, the buffer is removed and never returned in memory or saved in files. It is only loaded temporarily for computation.
### Modify the chunk size {#sec-engine-options-chunk-size}
In the example above we have seen that each chunk is actually a file. This is the default behavior because it corresponds to the physical splitting pattern. The engine can however sequentially process a `LAScatalog` in any arbitrarily sized chunk. This may be useful when each file is too big to be loaded in memory, and reducing the size of each processing region may be suitable. Sometimes increasing chunk size to process more than a single file might be useful as well. This is controlled by `opt_chunk_size() <-`, and again the function `plot()` enables the preview of the processing pattern.
```{r plot-chunk-pattern, fig.show='hold', fig.height=4, fig.width=4}
# Process sequentially tiny 250 x 250 chunks with 10 m buffer
opt_chunk_size(ctg) <- 250
opt_chunk_buffer(ctg) <- 10
plot(ctg, chunk = TRUE)
# Process sequentially bigger 800 x 800 chunks with 40 m buffer
opt_chunk_size(ctg) <- 800
opt_chunk_buffer(ctg) <- 40
plot(ctg, chunk = TRUE)
```
### Parallel processing {#sec-engine-options-parallel}
In `lidR` every iterative process can be computed in parallel on a single machine or on multiple machines accessible remotely. Remote parallelization is a very advanced case not covered in this book for which a [wiki page](https://github.com/r-lidar/lidR/wiki/Make-cheap-High-Performance-Computing-to-process-large-datasets) has been written.
In the following example only local parallelization is covered. Parallelization is driven by the package [`future`](https://github.com/HenrikBengtsson/future). Understanding this package is thus a requirement for being able to parallelize tasks in `lidR`. In the following example we load `future` and register a parallelization plan to enable parallelized classification.
``` r
library(future)
plan(multisession)
opt_chunk_size(ctg) <- 400
opt_chunk_buffer(ctg) <- 40
opt_output_files(ctg) <- paste0(tempdir(), "/{*}_classified")
classified_ctg <- classify_ground(ctg, csf())
```
Parallelization is a capability of the engine and every function is capable of benefiting from parallel processing. Registering a parallelization plan would have worked with `clip_circle()` as well. However doing so does no mean that classification is performed in parallel, but that **several chunks will be processed simultaneously**. This means that its important to pay attention to memory requirements and parallelization is not necessarily relevant in all cases or in all computers.
### Summary
We have learned several things about the engine in this section
4. The engine iteratively processes regions of interest called chunks.
5. Each chunk is loaded with a buffer on-the-fly.
6. Users can change chunk sizes to reduce the amount of memory used if files are too big.
7. The buffer is used to remove ridge artifacts, which are removed once the computation is done, and is not transferred to processing outputs.
8. Iterative processes can be performed in parallel.
9. Parallelization is performed by processing several chunks simultaneously. Users need processing memory to load several chunks at a time.
## Digital Terrain Model {#sec-engine-dtm}
So far we have learned enough about the engine to generate a nice and valid DTM using `rasterize_terrain()` (see @sec-dtm).
### In memory DTM {#sec-engine-dtm-inmemory}
Generating a DTM that encompasses an entire ALS acquisition is as easy as generating a DTM that encompasses a single point cloud file. The following generates a DTM with default parameters (i.e. processing by file with a 30 m buffer). In each chunk `rasterize_terrain()` computes a DTM. The result is a collection of rasters, however one feature of the engine is to merge everything in single, seamless, manipulable object. A raster is usually less memory intensive that a point cloud, so returning everything in memory is feasible if the raster is not too large.
```{r, echo = FALSE}
opt_chunk_size(ctg) <- 0
opt_chunk_buffer(ctg) <- 30
```
```{r}
dtm <- rasterize_terrain(ctg, 2, tin(), pkg = "terra")
```
We can render a shaded DTM like that from @sec-hillshade) to better visualize the output:
```{r plot-shaded-dtm-catalog, fig.height=7.9, fig.width=8, warning= FALSE}
dtm_prod <- terra::terrain(dtm, v = c("slope", "aspect"), unit = "radians")
dtm_hillshade <- terra::shade(slope = dtm_prod$slope, aspect = dtm_prod$aspect)
plot(dtm_hillshade, col = gray(0:50/50), legend = FALSE)
```
### On disk DTM {#sec-engine-dtm-ondisk}
When the DTM is too big for memory storage it can be written on disk. In this case users will have one raster file per chunk. The buffer is used to perform a valid computation but is removed after the computation, leaving no overlap between files. It is possible to build a virtual raster from multiple files to return a lightweight raster that references on disk files.
```{r dtmdisk, warning=FALSE}
opt_output_files(ctg) <- opt_output_files(ctg) <- paste0(tempdir(), "/{*}_dtm")
dtm <- rasterize_terrain(ctg, 1, tin())
dtm
class(dtm)
```
Light rasters can be used as if it were an in memory raster or a single file raster and is thus much more convenient than having hundred of files on disk without any structure to hold them all. This feature is called Virtual Dataset and is part of the Geospatial Data Abstraction Library (GDAL) (see also [gdalbuildvrt](http://gdal.org/gdalbuildvrt.html)).
### Summary
In this section we learned several the following about the engine in this section
10. The engine takes care of returning a single and manipulable object.
## Height normalization {#sec-engine-normalization}
Previous sections detail considerations for using the `LAScatalog` engine to perform point cloud normalization using `normalize_height()` (see @sec-norm. We created a DTM in the previous section, so we use it here to normalize.
```{r normalize}
opt_output_files(ctg) <- paste0(tempdir(), "/{*}_norm")
ctg_norm <- normalize_height(ctg, dtm)
```
A point cloud-based normalization without a raster is also possible:
``` r
opt_output_files(ctg) <- paste0(tempdir(), "/{*}_norm")
ctg_norm <- normalize_height(ctg, tin())
```
## Area Based Approach {#sec-engine-aba}
Area based approach (`pixel_metrics()`) outputs are raster objects like a DTM. Two more options of the engine do exist that are not controllable in every function. When reading a LAS file with `readLAS()` two options `select` and `filter` are offered (see @sec-io)). When processing a collection of files `readLAS()` is called internally and it is not possible to modify these parameters. Instead we can use the options `opt_filter()` and `opt_select()` to propagate these arguments.
``` r
opt_select(ctg) <- "xyzci"
opt_filter(ctg) <- "-keep_first"
```
```{r, echo = FALSE}
opt_output_files(ctg_norm) <- ""
opt_progress(ctg_norm) <- FALSE
```
Used in conjunction with `pixel_metrics()`, these enable computation on a selected set of points (first returns for example), loading only user specified attributes. This is useful mainly to save processing memory but may also have a small favorable impact on computation time.
```{r plot-raster-hmean-catalog, fig.height=5.9, fig.width=6.9}
opt_select(ctg_norm) <- "xyz"
hmean <- pixel_metrics(ctg_norm, ~mean(Z), 10)
opt_select(ctg_norm) <- "xyz"
opt_filter(ctg_norm) <- "-keep_first"
hmean <- pixel_metrics(ctg_norm, ~mean(Z), 10)
plot(hmean, col = height.colors(25))
```
Not all functions respect these two options. For example it does not make sense to load only XYZ when creating a DTM because the classification of points is required to isolate ground points. It is also non-beneficial to load intensity, user data, return number and so on to compute a DTM. The function `rasterize_terrain()` knows which options to choose and does not respect user inputs. This is the case of several other functions in `lidR`. But when its suitable, users can tune these options.
### Summary
We have learned several things about the engine in this section
11. Some options from `readLAS()`can be used to optimize and tune processing
12. Some `lidR` functions already know the best options and do not respect all user inputs.
## Individual Tree Detection {#sec-engine-itd}
At this stage with have all the tools required to find each individual tree using `locate_trees()` in a collection of files. The following example extracts every tree found over the acquisition. The output is returned in memory as a single, continuous object. Each chunk is processed with a buffer ensuring that trees will be properly detected at the edges of the files. A total of 100,000 trees were found in this example.
```{r, echo = FALSE}
opt_chunk_size(ctg_norm) <- 0
opt_chunk_buffer(ctg_norm) <- 30
opt_output_files(ctg_norm) <- ""
```
```{r plot-tree-catalog, fig.height=7, fig.width=7, warning=FALSE}
ttops <- locate_trees(ctg_norm, lmf(4))
ttops
plot(ttops["Z"], cex = 0.001, pch = 19, pal = height.colors, nbreaks = 30)
```
We see some issues with this however. Each tree is expected to be associated to a single `ID` but if we count the number of trees with the `ID = 1` we can see that we have 9 trees fitting that query. Similarly we have 9 trees labelled "2" and so on.
```{r}
sum(ttops$treeID == 1)
plot(ttops["treeID"], cex = 0.3, pch = 19, nbreaks = 50)
```
This error has been voluntarily introduced at this stage to illustrate how the `LAScatalog` engine works. In the engine, each chunk is processed **independently** of the others. They can even be processed on different machines and in a more or less random order. Thus there is no way to know how many trees were found in the first chunk to restart the numbering in the second chunk. This is mainly because the *'first and second'* chunk may not have any meaning when processing in parallel. The numbering is thus restarted to 1 in each chunk. This is why there is 9 trees numbered 1. One in each chunk. When the output is returned in memory this can be easily fixed by reassigning new IDs
``` r
ttops$treeID <- 1:nrow(ttops)
```
It is however more difficult to achieve the same task if each chunk is written in a file, and almost impossible to achieve in the case of individual tree segmentation as we will see later. To solve this `lidR` has two strategies to generate reproducible unique IDs no matter the processing order or the number of trees found. For more details the reader can refer to the documentation for `locate_trees()`. This option can be selected with the parameter `uniqueness`.
```{r, warning=FALSE}
ttops <- locate_trees(ctg_norm, lmf(4), uniqueness = "bitmerge")
plot(ttops["treeID"], cex = 0.01, pch = 19)
```
The attribute `treeID` is now an arbitrary number (here it is negative but not necessarily) computed from the XY coordinates, that is guaranteed to be **1.** unique, and **2.** reproducible. No matter how the collection of files is processed (by files or by small chunks, in parallel or sequentially, with large or narrow buffer, in memory or on disk) the IDs will always be the same for a given tree.
In all the previous examples we have seen that the engine takes care of merging intermediate results into a single usable object. When intermediate results are stored on disk, the final results in R is a single lightweight object such as a `LAScatalog` or a Virtual Data set. This is not true for `locate_trees()` and other functions that return spatial vectors. The default behaviour is to write Geopackage (.gpkg) and there is no way to build a light weight object. What is returned is thus a vector of written files.
```{r, warning = FALSE}
opt_output_files(ctg_norm) <- paste0(tempdir(), "/{*}")
locate_trees(ctg_norm, lmf(4), uniqueness = "bitmerge")
```
### Summary
We have learned several things about the engine in this section
13. Each chunk is processed strictly independently of the others.
14. When the intermediate outputs are LAS files, the final output is a lightweight `LASctalog`. When the intermediate outputs are raster files, the final output is a lightweight Virtual Data set. However for spatial vectors and some other data types it is not always possible to aggregate files into a single lightweight object. Thus the names of the files are returned.
15. Strict continuity of the output is not always trivial because each chunk is processed independently of the others but `lidR` always guarantee the validity and the continuity of the outputs.
## Individual Tree Segmentation {#sec-engine-its}
Individual tree segmentation with `segment_trees()` is the most complicated function to use with a `LAScatalog` because there are many different algorithms. Some imply the use of an in memory raster and an in memory vector, some require only an in memory raster and some require only a point cloud (see @sec-itd-its). Moreover it is complex to manage edge artifacts. Imagine there is a tree that is situated exactly on the edge between two files. The first half on one file, the second half on another. The two files will be processed independently - maybe on different remote computers. Both sides of the tree must be attributed the same ID independently to get something that is strictly continuous. The function `segment_trees()` manages all these points. The example below uses the `dalponte2016()` algorithm.
First we create a 1 m resolution CHM stored on disk and returned as a light virtual raster.
```{r plot-chm-catalog, fig.height=5.9, fig.width=6.9, warning=FALSE}
opt_output_files(ctg_norm) <- paste0(tempdir(), "/chm_{*}")
chm <- rasterize_canopy(ctg_norm, 1, p2r(0.15))
plot(chm, col = height.colors(50))
```
Second we compute the seeds by finding the trees as seen above. The result **must be** loaded in memory because there is no way to combine many vectors stored on disk like rasters. In this example it is possible because there are 100,000 trees. But for bigger collections it may not be possible to apply this algorithm in a simple way.
``` r
opt_output_files(ctg_norm) <- ""
ttops <- locate_trees(ctg_norm, lmf(4), uniqueness = "bitmerge")
```
To finish, we apply `segment_trees()`. Here we don't need to specify any strategy to get unique IDs because the seeds are already uniquely labelled. These IDs will be preserved. But for an algorithm without any seed such as `watershed()` it is important to have a strategy.
```{r, warning=FALSE}
opt_output_files(ctg_norm) <- paste0(tempdir(), "/{*}_segmented")
algo <- dalponte2016(chm, ttops)
ctg_segmented <- segment_trees(ctg_norm, algo)
```
The new `LAScatalog` that is returned is made up of files that have an extra attribute named `treeID` where each point is labelled with an ID thats unique for each tree, even those that belong between two or more files that were processed independently. We can load a plot between two files to check:
```{r plot-las-tree-segmented-cliped, rgl = TRUE}
opt_output_files(ctg_segmented) <- ""
lasplot <- clip_circle(ctg_segmented, 338500, 5238600, 40)
pol = crown_metrics(lasplot, NULL, geom = "convex")
plot(sf::st_geometry(pol), col = pastel.colors(250), axes = T)
plot(ctg, add = T)
```
We can observe that the segmentation is perfect with respect to the labelling problem. Trees that were segmented twice independently on each edge were attributed the same IDs on both sides and that the final output is wall-to-wall.
## Retile a catalog {#sec-engine-retile}
The function `catalog_retile()` allows for retiling an ALS acquisition of files. This is the best example to combine everything we have seen in this section.
### Retile a catalog into smaller files
`catalog_retile()` allows for retiling the acquisition of files in tiles of any size.
```{r plot-retiled-small, fig.height=4, fig.width=4}
opt_output_files(ctg) <- paste0(tempdir(), "/{XLEFT}_{YBOTTOM}") # label outputs based on coordinates
opt_chunk_buffer(ctg) <- 0
opt_chunk_size(ctg) <- 250 # retile to 250 m
small <- catalog_retile(ctg) # apply retile
plot(small) # some plotting
```
### Add a buffer around each file
`catalog_retile()` is a special case where the buffer **is not** removed after computation. The function does not compute anything, but only streams point from inputs files to output files. It can be used to add a buffer around each file.
```{r plot-retiled-buffer, fig.height=4, fig.width=4}
opt_chunk_buffer(ctg) <- 20 # set buffer to 20
opt_chunk_size(ctg) <- 0 # no change to chunk size
opt_output_files(ctg) <- paste0(tempdir(), "/{ORIGINALFILENAME}_buffered") # name outputs with original name and "_buffered"
buffered <- catalog_retile(ctg) # apply buffer
plot(buffered) # some plotting
```
Be careful. This may be useful for processing in other softwares, but `lidR` loads a buffers on-the-fly and does not support already buffered files. When processing a buffered collection in `lidR` the output is likely to be incorrect with the default parameters.
### Create a new collection with only first returns
In combination with `opt_filter()`, `catalog_retile()` can be used to generate a new collection of files that contains only first returns. This is not very useful in `lidR` because this can be achieved on-the-fly when reading the files using `filter()` (see @sec-filter), though could be useful to process point clouds in other software.
``` r
opt_chunk_buffer(ctg) <- 0
opt_chunk_size(ctg) <- 0
opt_output_files(ctg) <- paste0(tempdir(), "/{ORIGINALFILENAME}_first")
opt_filter(ctg) <- "-keep_first"
first <- catalog_retile(ctg)
```
### Create a new collection of small and buffered ground returns in parallel
We can combine all the options seen in this section to generate a buffered set of tiny files containing only ground returns. We present that here in parallel.
```{r plot-retiled-all, fig.height=4, fig.width=4, message=FALSE}
library(future)
plan(multisession)
opt_output_files(ctg) <- paste0(tempdir(), "/{XLEFT}_{YBOTTOM}_first_buffered")
opt_chunk_buffer(ctg) <- 10
opt_chunk_size(ctg) <- 250
opt_filter(ctg) <- "-keep_class 2"
newctg <- catalog_retile(ctg)
plot(newctg)
```
## The case of ground inventories {#sec-engine-independent-files}
The case of ground inventories or more generally independent files is particular. We generated such collections in the first example of this section.
```{r plot-catalog-inventory, echo=FALSE, fig.height=4, fig.width=4}
plot(rois)
```
`rois` is made of unrelated circular files. Loading a buffer is meaningless because there is no neighbouring files. Creating chunks is meaningless as well because we usually want one output per file without generating a wall-to-wall object. Worse even, with the wrong options the output might be incorrect.
For example on the top right of the previous collection we can see two overlapping files. If processed by chunk and with a buffer this will load the same points twice in the overlapping area, plus some extra points from the other plot that are not related at all to the first plot. In short, in the case of ground plot inventories 99% of the cases consist in processing iteratively by file, without a buffer and without returning a single aggregated object. This case actually corresponds to a regular `for` loop, as shown in the introduction of this section. Thus there is no need for all the features provided by the `LAScatalog` engine and anyone with any background in programming can write a small script that does this job.
The engine can however do it with appropriate options. In this case the `LAScatalog` processing engine becomes more or less a parallelizable `for` loop with a real time monitoring.
```{r plot-dtm-inventory, fig.height=4, fig.width=4,warning=FALSE, eval = FALSE}
opt_chunk_size(rois) <- 0 # processing by files
opt_chunk_buffer(rois) <- 0 # no buffer
opt_wall_to_wall(rois) <- FALSE # disable internal checks to ensure a valid output. Free wheel mode
opt_merge(rois) <- FALSE
opt_output_files(rois) <- ""
dtm <- rasterize_terrain(rois, 1, tin())
plot(dtm[[1]], col = gray(1:50/50))
```
This can be set in one command:
```{r, fig.height=4, fig.width=4}
opt_independent_files(rois) <- TRUE
```
Everything seen so far remains true. But with these options we are sure to not make mistakes when processing independent files.
## Summary of available options {#sec-engine-summary-sheet}
The engine processes the covered area by chunk. A chunk is an arbitrary square region queried from the collection of files. Each chunk is loaded with a buffer, and each chunk correspond to an independent results. All together the chunks tessellate the coverage unless files are not overlapping. Results are merged into a single object at the end of the processing when possible.
- `opt_chunk_size(ctg) <- 800` controls the size of the chunks. Set 0 use the tiling pattern (processing by file)
- `opt_chunk_buffer(ctg) <- 40` controls the size of the buffer around each chunks.
- `opt_chunk_alignment(ctg) <- c(100, 200)` controls the alignment of the chunks
- `opt_output_file(ctg) <- "{TEMPLATE}` redirects the results in files named after the templated name.
- `opt_select(ctg) <- "xyzi"` loads only the attributes of interest when querying a chunk, in this case the xyz and intensity values (see `readLAS()` - @sec-read))
- `opt_filter(ctg) <- "-keep_first"` loads only the points of interest when querying a chunk (see `readLAS()` - @sec-read)
- `opt_wall_to_wall(ctg) <- FALSE` disables some internal control that guarantee that output will be strictly wall-to-wall. It should not be used actually because `opt_independent_files()` should cover the vast majority of cases
- `opt_independent_files() <- TRUE` to configure the processing for the special case of independent files (typically plot inventories). It turns the engine into a simple loop and the notion of chunks covering the area is lost.
- `opt_merge(ctg) <- FALSE` disables the final merging. In this case a `list` is returned in all cases.
- `opt_progress(ctg) <- FALSE` disables progress estimation ticker.
- `opt_stop_early(ctg) <- FALSE` disable early exit. If a chunk throws an error the processing keeps going and this chunk will be missing in the output.
- Use `future` and register a parallelization plan to process several chunks at a time taking advantage of multiple cores or multiple machines architectures.
One can use the function `summary()` to display the main current processing options of the catalog:
```{r}
summary(ctg)
```