This repository contains the functionality to process occurrence data and create aggregated occurrence cubes. An occurrence cube is a multi-dimensional array of values. In our context we have three dimensions, N = 3
:
- taxonomic (taxon)
- temporal (year)
- spatial (cell code)
For each triplet the stored value represent the number of occurrences found in GBIF.
As the occurrence cubes are used as input for modelling and risk assessment, we store the smallest geographic coordinate uncertainty of the occurrences assigned to a certain cell code as value as well. The occurrences are first reassigned randomly within their uncertainty circle before assigning them to a cell. If uncertainty is not available, a default 1000m radius is assigned. Due to the random assignment, the same occurrence data could result in different occurrence cubes if iterated.
Using a tabular structure (typical of R data.frames), a cube would look like this:
taxon | year | cell code | number of occurrences | minimal coordinate uncertainty |
---|---|---|---|---|
2366634 | 2002 | 1kmE3872N3101 | 8 | 250 |
2382155 | 2002 | 1kmE3872N3101 | 3 | 250 |
2498252 | 2002 | 1kmE3872N3149 | 2 | 1000 |
5232437 | 2002 | 1kmE3872N3149 | 4 | 1000 |
where number of columns is equal to the number of dimensions, N
, plus number of values. In our case we have three dimensions and two values, so five columns in total.
One of the main output of TrIAS project is delivering a global, unified checklist of alien species in Belgium. This checklist is published on GBIF as Global Register of Introduced and Invasive Species - Belgium. More information can be found here. In this repository we produce an occurrence cube of all taxa of the unified checklist that have at least one occurrence in Belgium. As we need to make an occurrence cube at class level (baseline), we harvested ALL occurrences in Belgium. This first step is documented in /src/belgium/1_download.Rmd. At the moment of writing, the GBIF download counts around 35 million occurrences. In order to handling such huge amount of data, we decided to work using a SQLite file, see second step: /src/belgium/2_create_db.Rmd. Handling occurrence geographic uncertainty and assigning the correspondent cell codes to occurrences is the third step, documented in /src/belgium/3_assign_grid.Rmd. Finally, we select the alien species belonging to the Global Register of Introduced and Invasive Species - Belgium and aggregate the occurrences by year, cell code and taxon as described in the fourth and last pipeline: /src/belgium/4_aggregate.Rmd. The resulting occurrence cube is saved in file /data/processed/be_alientaxa_cube.tsv
. We also produce an occurrence cube at class level, /data/processed/be_classes_cube.tsv
which can be used as baseline to tackle the research bias effort while calculating occurrence-based indicators.
The TrIAS project aims to assess risk of invasion by applying distribution modelling and other modelling techniques for a subset of taxa of the unified checklist. The list is saved in file /data/reference/modelling_species.tsv
. For these species we build a specific occurrence cube which takes into account all occurrences in Europe. The region we define as Europe is described by the European Environmental Agency, see image here. Similarly to the Belgian cube, we first download the occurrences of these species following the workflow described in src/europe/1_download.Rmd
. Then, we assign the occurrences randomly within their uncertainty circles in order to calculate the 1kmx1km cell they belong to, see /src/europe/2_assign_grid.Rmd
and finally we aggregate as described in /src/europe/3_aggregate.Rmd in order to produce the final occurrence cube at European level, eu_modellingtaxa_cube.tsv, saved in data/processed
.
If a taxon has taxonomic status ACCEPTED
or DOUBTFUL
, i.e. it's not a synonym, then GBIF returns not only the occurrences linked directly to it, but also the occurrences linked to its synonyms and its infraspecific taxa.
As example, consider the species Reynoutria japonica Houtt.`. If you search for its occurrrences wordwide you will get all the occurrences from the synonyms and infraspecies too.
taxonKey | scientificName | numberOfOccurrences | taxonRank | taxonomicStatus |
---|---|---|---|---|
5652243 | Fallopia japonica f. colorans (Makino) Yonek. 41 | FORM | SYNONYM | |
5652241 | Fallopia japonica var. compacta (Hook.fil.) J.P.Bailey | 52 | VARIETY | SYNONYM |
2889173 | Reynoutria japonica Houtt. | 39576 | SPECIES | ACCEPTED |
4038356 | Reynoutria japonica var. compacta (Hook.fil.) Buchheim | 19 | VARIETY | SYNONYM |
4033014 | Tiniaria japonica (Houtt.) Hedberg | 28 | SPECIES | SYNONYM |
5652236 | Fallopia japonica var. uzenensis (Honda) K.Yonekura & Hiroyoshi Ohashi | 212 | VARIETY | SYNONYM |
5334352 | Polygonum cuspidatum Sieb. & Zucc. | 1570 | SPECIES | SYNONYM |
7291566 | Polygonum japonicum (Houttuyn) S.L.Welsh | 2 | SPECIES | SYNONYM |
5334357 | Fallopia japonica (Houtt.) Ronse Decraene | 110742 | SPECIES | SYNONYM |
7291912 | Reynoutria japonica var. japonica | 2199 | VARIETY | ACCEPTED |
6709291 | Reynoutria compacta (Hook.fil.) Nakai | 1 | SPECIES | SYNONYM |
7413860 | Reynoutria japonica var. terminalis (Honda) Kitag. | 13 | VARIETY | SYNONYM |
8170870 | Reynoutria japonica var. uzenensis Honda | 32 | VARIETY | SYNONYM |
7128523 | Fallopia japonica var. japonica | 1560 | VARIETY | DOUBTFUL |
5651605 | Polygonum compactum Hook.fil. | 28 | SPECIES | SYNONYM |
5334355 | Pleuropterus zuccarinii Small | 1 | SPECIES | SYNONYM |
4038371 | Reynoutria henryi Nakai | 14 | SPECIES | SYNONYM |
8361333 | Fallopia compacta (Hook.fil.) G.H.Loos & P.Keil | 24 | SPECIES | SYNONYM |
7291673 | Polygonum reynoutria (Houtt.) Makino | 3 | SPECIES | SYNONYM |
See https://doi.org/10.15468/dl.rej1cz for more details. Note: the table above is just an example and can be outdated.
By aggregating we would loose this information, so we provide aside the cubes, be_alientaxa_cube.tsv
and eu_modellingtaxa_cube.tsv
, a kind of taxonomic compendium, be_alientaxa_info.tsv
and eu_modellingtaxa_info.tsv
respectively. They include for each taxa in the cube all the synonyms or infraspecies whose occurrences contribute to the total count. They are both saved in data/processed
.
For example, Aedes japonicus (Theobald, 1901) is an accepted species present in the belgian cube: based on the information stored in be_alientaxa_info.tsv
, its occurrences include occurrences linked to the following taxa:
- Aedes japonicus (Theobald, 1901)
- Ochlerotatus japonicus (Theobald, 1901)
- Aedes japonicus subsp. japonicus
- BOLD:AAC5210
The repository structure is based on Cookiecutter Data Science. Files and directories indicated with GENERATED
should not be edited manually.
βββ README.md : Description of this repository
βββ LICENSE : Repository license
βββ occ-processing.Rproj : RStudio project file
βββ .gitignore : Files and directories to be ignored by git
β
βββ references
β βββ Europe.png : Map of Europe
β βββ modelling_species.tsv: List of species whos occurrences are queried from GBIF at European level
β
βββ data
β βββ raw : Occurrence data as downloaded from GBIF GENERATED
β βββ interim : big sqlite and text files, stored locally GENERATED
β βββ processed : occurrence data cubes and related taxa informations GENERATED
β
βββ docs : Repository website (not implemented yet) GENERATED
β
βββ src
βββ belgium
βββ 1_download.Rmd : Script to trigger a download of occurrences in Belgium
βββ 2_create_db.Rmd : Script to genereate a sqlite file and perform basic filtering
βββ 3_assign_grid.Rmd : Script to assign cell code to occurrences
βββ 4_aggregate.Rmd : Script to aggregate data and make the Belgian data cube
βββ europe
βββ 1_download.Rmd : Script to trigger a download of occurrences in Belgium
βββ 2_assign_grid.Rmd : Script to perform basic filtering and assign cell code to occurrences
βββ 3_aggregate.Rmd : Script to aggregate data and make the modelling data cube at European level
Clone this repository to your computer and open the RStudio project file, occ-processing.Rproj
.
You can generate the Belgian occurrence data cube by running the R Markdown files in src/belgium
following the order shown here below:
1_download.Rmd
: trigger a GBIF download and add it to the list of triggered downloads2_create_db.Rmd
: create a sqlite database and perform basic data cleaning3_assign_grid.Rmd
: assign geographic cell code to occurrence data4_aggregate.Rmd
: aggregate occurrences per taxon, year and cell code, the Belgian occurrence data cube
In the aggregation step, we also create a data cube at class level. The data cubes are authomatically generated in folder /data/processed/
.
At European level we are interested in occurrences of a list of taxa, which will be used for modelling and risk assessment. This list is maintained in file modelling_species.tsv
in folder references
.
You can generate the European occurrence data cube by running the R Markdown files in src/europe
following the order shown here below:
1_download.Rmd
: trigger a GBIF download and adding it to the list of triggered downloads2_assign_grid.Rmd
: assign geographic cell code to occurrence data3_aggregate.Rmd
: aggregate occurrences per taxa, year and cell code, the European occurrence data cube.
Install any required packages.
MIT License for the code and documentation in this repository.