Skip to content

Utilities to adapt the GBIF Backbone Taxonomy for it usage by LA portals

License

Notifications You must be signed in to change notification settings

biodiversitydata-se/gbif-taxonomy-for-la

 
 

Repository files navigation

gbif-taxonomy-for-la

Utilities to adapt the GBIF Backbone Taxonomy for it usage by LA portals.

Introduction

This is a set of utilities to convert the GBIF Backbone Taxonony for it usage by LA portals. This consist in:

  • Download the current GBIB backbone taxonomy and some dependencies
  • Separate scientificName and scientificNameAuthorship from Taxon.txv
  • Generate a subset of the original VernacularName.tsv file based on the language field.
  • Detect issues during this convertion that can be reported back to GBIF (see target/backbone/issues.tsv).
  • Workaround for some issues in the taxonomy that prevent the nameindexer to work properly.
  • Generate lucene 6 and lucene 8 indexes for us by LA portals (for nameindexer, namematching-service and sensitive-data-service)
  • Generate the modified dwca for bie-index

The resulting indexes are published in updated LA ansible inventories (those generated by the la-toolkit), so in general you can use these indexes without the need to run this repository (that is quite disk and time consumming).

Dependencies

This tool use docker that downloads all the dependencies.

Usage

Some usage help:

docker build . --tag gbif-taxonomy-for-la 

and with the image ready:

./gbif-taxonomy-for-la-docker --help

Options:
      --backbone                    Download GBIF backbone taxonomy
      --name-authors                Split name and authors from the GBIF backbone
      --prepare-tests               Prepare tests
      --tests                       Run tests
      --filter_lang=<langs>         Filter VernacularName.tsv file for given language [default: ].
      --namematching-distri=<nmv>   Download ALA namematching-distribution version [default: 4.3].
      --namematching-index          Generate namematching index
      --namematching-index-legacy   Generate namematching index legacy (pre namemaching-service)
      --dwca                        Regenerate the dwca zip
      --help                        Show help options.
      --version                     Print program version.

Download the gbif taxonomy:

./gbif-taxonomy-for-la-docker --backbone 2023-12-18

Optionally, filter common names from VernacularName.tsv based on language passed as comma seperated list:

./gbif-taxonomy-for-la-docker --backbone --filter_lang=en,sv 2023-12-18

Full process, download nameindex, select lang, split scientificName and scientificNameAuthorship, and generate indexes and recreate modified backbone dwca using a date as file suffix:

./gbif-taxonomy-for-la-docker --backbone --filter_lang=en,sv --name-authors --namematching-distri=4.3 --namematching-index --namematching-index-legacy --dwca 2024-02-09-sv 

Just generate the indexes:

./gbif-taxonomy-for-la-docker --namematching-index --namematching-index-legacy 2023-12-18

Tests

./gbif-taxonomy-for-la-docker --backbone --prepare-tests --tests 20231218 

./gbif-taxonomy-for-la-docker --prepare-tests --tests 20231218 

Contributing

If you detect some extra issue Pull Request welcome.

TODO

  • Evaluate the use of checklistbank import/export to generate de dwca correctly instead of the nodejs transformation of scientificName/Authorship (using simple.txt.gz to import in postgres).
  • Format the GBIF VernacularName.tsv so it can be used by the ala-namamatching software (see expected csv format). It requires adding the scientificName then it will be interesting to use postgres too.

About

Utilities to adapt the GBIF Backbone Taxonomy for it usage by LA portals

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 57.8%
  • Shell 29.1%
  • Dockerfile 13.1%