Utilities to adapt the GBIF Backbone Taxonomy for it usage by LA portals.
This is a set of utilities to convert the GBIF Backbone Taxonony for it usage by LA portals. This consist in:
- Download the current GBIB backbone taxonomy and some dependencies
- Separate
scientificName
andscientificNameAuthorship
fromTaxon.txv
- Generate a subset of the original
VernacularName.tsv
file based on thelanguage
field. - Detect issues during this convertion that can be reported back to GBIF (see
target/backbone/issues.tsv
). - Workaround for some issues in the taxonomy that prevent the nameindexer to work properly.
- Generate lucene 6 and lucene 8 indexes for us by LA portals (for nameindexer, namematching-service and sensitive-data-service)
- Generate the modified dwca for bie-index
The resulting indexes are published in updated LA ansible inventories (those generated by the la-toolkit), so in general you can use these indexes without the need to run this repository (that is quite disk and time consumming).
This tool use docker that downloads all the dependencies.
Some usage help:
docker build . --tag gbif-taxonomy-for-la
and with the image ready:
./gbif-taxonomy-for-la-docker --help
Options:
--backbone Download GBIF backbone taxonomy
--name-authors Split name and authors from the GBIF backbone
--prepare-tests Prepare tests
--tests Run tests
--filter_lang=<langs> Filter VernacularName.tsv file for given language [default: ].
--namematching-distri=<nmv> Download ALA namematching-distribution version [default: 4.3].
--namematching-index Generate namematching index
--namematching-index-legacy Generate namematching index legacy (pre namemaching-service)
--dwca Regenerate the dwca zip
--help Show help options.
--version Print program version.
Download the gbif taxonomy:
./gbif-taxonomy-for-la-docker --backbone 2023-12-18
Optionally, filter common names from VernacularName.tsv based on language passed as comma seperated list:
./gbif-taxonomy-for-la-docker --backbone --filter_lang=en,sv 2023-12-18
Full process, download nameindex, select lang, split scientificName
and scientificNameAuthorship
, and generate indexes and recreate modified backbone dwca using a date as file suffix:
./gbif-taxonomy-for-la-docker --backbone --filter_lang=en,sv --name-authors --namematching-distri=4.3 --namematching-index --namematching-index-legacy --dwca 2024-02-09-sv
Just generate the indexes:
./gbif-taxonomy-for-la-docker --namematching-index --namematching-index-legacy 2023-12-18
./gbif-taxonomy-for-la-docker --backbone --prepare-tests --tests 20231218
./gbif-taxonomy-for-la-docker --prepare-tests --tests 20231218
If you detect some extra issue Pull Request welcome.
- Evaluate the use of checklistbank import/export to generate de dwca correctly instead of the nodejs transformation of scientificName/Authorship (using simple.txt.gz to import in
postgres
). - Format the GBIF
VernacularName.tsv
so it can be used by the ala-namamatching software (see expected csv format). It requires adding the scientificName then it will be interesting to use postgres too.