Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard / DCAT (and profiles) export #7600

Open
wants to merge 96 commits into
base: main
Choose a base branch
from

Conversation

fxprunayre
Copy link
Member

@fxprunayre fxprunayre commented Jan 9, 2024

Based on the recent publication of various profiles of DCAT (at least in Europe), GeoNetwork needs to improve its capacity to export metadata records in DCAT. GeoNetwork export to DCAT format initially done in 2012 was targeting interaction with semantic service and semantic sitemap support. Later was added some changes related to GeoDCAT-AP to improve the mapping but the mapping was not fully consistent with SEMICeu work (eg. ISO19139 to GeoDCAT-AP XSL conversion). Actually, new DCAT profiles are defined and some datasets and services managed in catalogue are in the scope of those profiles (eg. HVD, mobility).

DCAT mappings and formatters

This proposal adds:

The mapping is done from ISO19115-3 to DCAT*. An ISO19139 to ISO19115-3 conversion can be applied before if needed.

The SEMICeu XSLT conversion is also included with minor improvements (https://github.com/SEMICeu/iso-19139-to-dcat-ap/pulls). This conversion is from ISO19139 to RDF and if needed a conversion from ISO19115-3 is applied.

The mapping was created with:

  • BRGM and Wallonia Region (SPW)
  • Existing GeoNetwork DCAT mapping
  • SEMICeu ISO to GeoDCAT-AP conversion

Each DCAT formats are available using a formatter eg. http://localhost:8080/geonetwork/srv/api/records/be44fe5a-65ca-4b70-9d29-ac5bf1f0ebc5/formatters/eu-dcat-ap

Validation

Validation in test:

  • check that RDF graph is valid
  • check using SHACL rules (disabled)

Online validation tool:

Opendata portal testing

Tested with success with the following data portals:

  • https://data.gov.be/
  • https://www.data.gouv.fr/ (with some limitations noticed, udata does not support: temporal range with no end date - now in ISO, multiple resource identifiers, EU vocabularies for licenses, series are not imported)

Mapping discussion

Embedded objects vs. references

ISO rarely contains references to objects but it can be done with various encoding:

  • using Anchor eg. keywords
 <gmd:keyword>
    <gmx:Anchor xlink:href="http://rod.eionet.europa.eu/obligations/102">Greenhouse gas inventories (UNFCCC)</gmx:Anchor>
 </gmd:keyword>
  • using @uuid (sometimes used in contact)
<gmd:contact>
  <gmd:CI_ResponsibleParty uuid="https://edmo.seadatanet.org/report/4667">
  • using a specific field eg. metadataLinkage, identifier party
<cit:partyIdentifier>
  <mcc:MD_Identifier>
     <mcc:code>
        <gco:CharacterString>https://orcid.org/0000-0002-5150-2276</gco:CharacterString>
     </mcc:code>
     <mcc:codeSpace>
        <gco:CharacterString>ORCID</gco:CharacterString>
<mdb:metadataLinkage>
  <cit:CI_OnlineResource>
    <cit:linkage>
      <gco:CharacterString>https://metawal.wallonie.be/geonetwork/srv/api/records/3dc91d81-338f-425d-b27b-f62aa0464e21</gco:CharacterString>
    </cit:linkage>
    <cit:function>
      <cit:CI_OnLineFunctionCode codeList="http://standards.iso.org/iso/19115/resources/Codelists/cat/codelists.xml#CI_OnLineFunctionCode" codeListValue="completeMetadata"/>

The DCAT mapping provides an entry point to customize where to pick up object references.

DCAT in CSW service

All DCAT profiles are also accessible using CSW protocol.
A GetRecordById operation can be used: http://localhost:8080/geonetwork/srv/eng/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetRecordById&ID=da165110-88fd-11da-a88f-000d939bc5d8&outputSchema=https://semiceu.github.io/DCAT-AP/releases/2.2.0-hvd/ and is equivalent to the API http://localhost:8080/geonetwork/srv/api/records/da165110-88fd-11da-a88f-000d939bc5d8/formatters/eu-dcat-ap-hvd . For this, outputSchema configuration is improved to not be mixed with typenames.

image

If an outputSchema does not provide brief, summary, full variations, only one XSLT can be provided (so it is easier to bridge CSW output to a formatter):

image

DCAT download from the record view

To add the formatter in the record view download list:

image

{
  "mods": {
    "search": {
      "downloadFormatter": [
        {
          "label": "exportMEF",
          "url": "/formatters/zip?withRelated=false",
          "class": "fa-file-zip-o"
        },
        {
          "label": "exportPDF",
          "url": "/formatters/xsl-view?output=pdf&language=${lang}",
          "class": "fa-file-pdf-o"
        },
        {
          "label": "exportXML",
          "url": "/formatters/xml",
          "class": "fa-file-code-o"
        },
        {
          "label": "DCAT",
          "url": "/formatters/dcat"
        },
        {
          "label": "EU-DCAT-AP",
          "url": "/formatters/eu-dcat-ap"
        },
        {
          "label": "EU-GEO-DCAT-AP",
          "url": "/formatters/eu-geodcat-ap"
        },
        {
          "label": "EU-DCAT-AP-MOBILITY",
          "url": "/formatters/eu-dcat-ap-mobility"
        },
        {
          "label": "EU-DCAT-AP-HVD",
          "url": "/formatters/eu-dcat-ap-hvd"
        }
      ]

Related item

Future work & known limitation

  • Improve URI scheme for each resources (eg. organisations, contacts, distributions, keywords)
  • Mapping improvements eg.
    • updates following ongoing GeoDCAT-APv3 revision
    • currently not supported property: distribution > spdx:checksum,
    • improve associated resources mapping (using API?) eg. service to and from dataset relations ?
  • Add those formats to OGC API Records in order to have same mapping everywhere.
  • Add schematron validation in order to ensure user editing on ISO records are populating all fields needed for each DCAT profiles

Supported by

Funded by BRGM
Funded by Wallonia region (SPW)

Checklist

  • I have read the contribution guidelines
  • Pull request provided for main branch, backports managed with label
  • Good housekeeping of code, cleaning up comments, tests, and documentation
  • Clean commit history broken into understandable chucks, avoiding big commits with hundreds of files, cautious of reformatting and whitespace changes
  • Clean commit messages, longer verbose messages are encouraged
  • API Changes are identified in commit messages
  • Testing provided for features or enhancements using automatic tests
  • User documentation provided for new features or enhancements in manual
  • Build documentation provided for development instructions in README.md files
  • Library management using pom.xml dependency management. Update build documentation with intended library use and library tutorials or documentation

fxprunayre and others added 30 commits December 6, 2023 11:29
Proposal to better organize various flavor of DCAT. The target is to support:
* DCAT
* EU DCAT-AP
* EU DCAT-AP mobility
* EU GeoDCAT-AP.
…ic is required even if isPrimaryTopic is defined. Only one rights allowed.
@josegar74 josegar74 self-requested a review June 3, 2024 08:43
@fxprunayre fxprunayre modified the milestones: 4.4.5, 4.4.6 Jun 13, 2024
Range: dcterms:Frequency (A rate at which something recurs)
Usage note: The value of dcterms:accrualPeriodicity gives the rate at which the dataset-as-a-whole is updated. This may be complemented by dcat:temporalResolution to give the time between collected data points in a time series.
-->
<xsl:variable name="isoFrequencyToDublinCore"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am under the impression that this mapping is not applied. On my experiments, annually is kept as-is instead of being replaced by ANNUAL in the dcat output (default DCAT outputSchema=http://www.w3.org/ns/dcat#)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous remark applies to csw-dcat output.

But indeed it works when formattin the file through the dcat download formatter /formatters/dcat (download button)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://www.w3.org/ns/dcat# correspond to current conversion (see dcat-brief.xsl)

Use http://www.w3.org/ns/dcat#core when using CSW
eg.
http://localhost:8080/geonetwork/srv/fre/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetRecordById&ID=da165110-88fd-11da-a88f-000d939bc5d8&outputschema=http%3A%2F%2Fwww.w3.org%2Fns%2Fdcat%23core

Maybe the current conversion should be renamed to http://www.w3.org/ns/dcat#legacy ? but that will alter current apps using the current DCAT mapping

@jeanpommier
Copy link
Contributor

About the licence section, can you provide a licence definition (XML snippet for instance) that would be properly recognized by data.gouv.fr ?

@fxprunayre
Copy link
Member Author

About the licence section, can you provide a licence definition (XML snippet for instance) that would be properly recognized by data.gouv.fr ?

eg.
https://apps.titellus.net/geonetwork/srv/eng/catalog.search#/metadata/fa6e13d4-5700-4391-a472-dd29b82ba3dbREMITEST
which correspond to
https://demo.data.gouv.fr/fr/datasets/lieux-dobservation-et-de-surveillance-du-reseau-remitest/

@josegar74
Copy link
Member

There are some vocabularies, like data-themes, HVD and licenses that are provided in the formatters folder as seem used in the xslt. HVD seems requiring keywords from that vocabulary in the metadata records. Also Licenses seems require to be used in use constraints elements in the metadata records.

For data-themes, doesn't seem to be associated in the metadata to that vocabulary (https://github.com/geonetwork/core-geonetwork/pull/7600/files#diff-e032efa415655cf3ab6df515d6853952475f431571cbbaf1e2005257761accf2R216-R244), which looks bizarre, but I don't have all the context.

Can be these vocabularies downloaded from the INSPIRE Registry? Otherwise, including them in the formatters folder doesn't seem a good location, if the users need to load them in GeoNetwork.

@fxprunayre
Copy link
Member Author

Licenses seems require to be used in use constraints elements in the metadata records.

For license, SHACL validation may expect EU codelist value eg. http://publications.europa.eu/resource/authority/licence/CC0

So we need to map more general values like "http://creativecommons.org/publicdomain/zero/1.0/" or "https://creativecommons.org/publicdomain/zero/1.0/" or the labels "Creative Commons Atribución 4.0 Internacional" to the EU value.

For data-themes, doesn't seem to be associated in the metadata to that vocabulary

SEMICeu conversion is providing a mapping between topic category or INSPIRE theme and data theme so data theme does not need to be encoded in the record.

Can be these vocabularies downloaded from the INSPIRE Registry? Otherwise, including them in the formatters folder doesn't seem a good location, if the users need to load them in GeoNetwork.

It depends, eg. HVD define vocabularies
https://semiceu.github.io/DCAT-AP/releases/2.2.0-hvd/#controlled-vocabularies-to-be-used
and some of them are available in
https://op.europa.eu/en/web/eu-vocabularies/dataset/-/resource?uri=http://publications.europa.eu/resource/dataset/high-value-dataset-category
so you've to import them from here if you want to encode HVD.

Also for applicable legislation, the vocabularies does not exist yet http://data.europa.eu/r5r/applicableLegislation so indeed a specific thesaurus was created.

So all this type of issue is indeed something hard to track because you need to first setup a proper template with required vocabularies depending on the type of DCAT flavor you want to produce. Not sure how we could improve that? By default, it would require to provide templates + vocabularies.

<!-- Resource
Unsupported:
* dcat:first|previous(sameAs replaces, previousVersion?)|next|last|hasVersion (using the Associated API, navigate to series and sort by date?)
* dct:isReferencedBy (using the Associated API)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean (using the Associated API)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the relations to other records are not stored in the record itself. It also depends on privileges on related records. So another approach could be to use the associated API which resolve relation in both direction and filter records based on privileges.

Comment on lines 42 to 43
<entry key="dcat:keyword">mdb:MD_Metadata/mdb:identificationInfo/mri:MD_DataIdentification/mri:descriptiveKeywords/mri:MD_Keywords/mri:keyword</entry>
<entry key="dcat:keyword">mdb:MD_Metadata/mdb:identificationInfo/srv:SV_ServiceIdentification/mri:descriptiveKeywords/mri:MD_Keywords/mri:keyword</entry>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Indeed there is 2 templates handling keyword with CharacterString and Anchor that are more specialized so it should be never used.

Comment on lines +38 to +59
<xsl:template mode="iso19115-3-to-dcat"
match="mdb:identificationInfo/*/mri:descriptiveKeywords/*/mri:keyword[gcx:Anchor/@xlink:href != '']"
priority="2">
<dcat:theme>
<skos:Concept>
<xsl:call-template name="rdf-object-ref-attribute"/>
<xsl:call-template name="rdf-localised">
<xsl:with-param name="nodeName"
select="'skos:prefLabel'"/>
</xsl:call-template>
</skos:Concept>
</dcat:theme>
</xsl:template>

<xsl:template mode="iso19115-3-to-dcat"
match="mdb:identificationInfo/*/mri:descriptiveKeywords/*/mri:keyword[gco:CharacterString/text() != '']"
priority="2">
<xsl:call-template name="rdf-localised">
<xsl:with-param name="nodeName"
select="'dcat:keyword'"/>
</xsl:call-template>
</xsl:template>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dcat:theme should be used when the keyword comes from a thesaurus, but here is checking only for gcx:Anchor.

While dcat:keyword keyword should be used for free text keywords, assuming that uses gco:CharacterString, but it can be that keywords from a thesaurus use gco:CharacterString, no?

Should not be checked if the keyword has a thesarusName element?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends a bit on what application are doing with theme and keyword. The main restriction from a DCAT AP point of view is https://github.com/SPW-DIG/metawal-core-geonetwork/blob/dcat/services/src/test/resources/org/fao/geonet/api/records/formatters/shacl/eu-dcat-ap-3.0.0/mdr-vocabularies.shape.ttl#L121-L130

Then it looks like in practice sometime only http://publications.europa.eu/resource/authority/data-theme vocabulary is used to set the dcat:theme and all others keywords are dcat:keyword. Here the conversion should be SHACL valid as far as you have a topic or INSPIRE theme mapped to data-theme. If there is a thesaurus, indeed we could produce a dcat:theme with a skos:Concept instead of a simple keyword.

See https://github.com/SPW-DIG/metawal-core-geonetwork/blob/dcat/services/src/test/resources/org/fao/geonet/api/records/formatters/iso19115-3.2018-eu-dcat-ap-dataset-core.rdf#L245-L380

Most of the time we are configuring the editor to encode ISO keyword using Anchor (as requested for INSPIRE). This is facilitated by #8118

<entry key="dct:description">mdb:MD_Metadata/mdb:distributionInfo/mrd:MD_Distribution/mrd:transferOptions/mrd:MD_DigitalTransferOptions/mrd:onLine/cit:CI_OnlineResource/cit:description</entry>
<entry key="dct:description">mdb:MD_Metadata/mdb:distributionInfo/mrd:MD_Distribution/mrd:distributor/mrd:MD_Distributor/mrd:distributorTransferOptions/mrd:MD_DigitalTransferOptions/mrd:onLine/cit:CI_OnlineResource/cit:description</entry>
<entry key="owl:versionInfo">mdb:MD_Metadata/mdb:metadataStandard/cit:CI_Citation/cit:edition</entry>
<entry key="adms:versionNotes">mdb:MD_Metadata/mdb:resourceLineage/mrl:LI_Lineage/mrl:statement</entry>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +295 to +298
Rule:
* Use mimetype if any
* Use WWW:DOWNLOAD:(.*=format) if any
* fallback to ancestor::mrd:MD_DigitalTransferOptions/mrd:distributionFormat/*/mrd:formatSpecificationCitation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit unclear, apparently for dcat:mediaType, dcat:compressFormat and dcat:packageFormat relies onmrd:MD_DigitalTransferOptions/mrd:distributionFormat/*/mrd:formatSpecificationCitation?

But in https://github.com/geonetwork/core-geonetwork/pull/7600/files#diff-5fa6856eb0e0615450b025fa6ac293be136ee98f4b82b908997ebf6667438ef8R399-R404 it is not set the elementName for any of them.

For dcat:compressFormat there is another template, that is also unclear if used: https://github.com/geonetwork/core-geonetwork/pull/7600/files#diff-5fa6856eb0e0615450b025fa6ac293be136ee98f4b82b908997ebf6667438ef8R407-R413

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +407 to +413
<xsl:template mode="iso19115-3-to-dcat-distribution"
match="mrd:distributionFormat/*/mrd:fileDecompressionTechnique">
<xsl:call-template name="rdf-format-as-mediatype">
<xsl:with-param name="elementName" select="'dcat:compressFormat'"/>
<xsl:with-param name="format" select="*/text()"/>
</xsl:call-template>
</xsl:template>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear as doesn’t seem to work, also apparently it seems the same value assigned to all the online resources?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this can be challenging as DCAT distribution contains a number of properties that ISO affect to the resource. So it could work fine if we have only one download link in a record or could be encoded using different distribution block but it is not the case usually.

There is a test for example
https://github.com/SPW-DIG/metawal-core-geonetwork/blob/dcat/services/src/test/resources/org/fao/geonet/api/records/formatters/iso19115-3.2018-dcat-dataset.xml#L1254-L1294
is converted to
https://github.com/SPW-DIG/metawal-core-geonetwork/blob/dcat/services/src/test/resources/org/fao/geonet/api/records/formatters/iso19115-3.2018-eu-dcat-ap-dataset-core.rdf#L554-L570

Dedicated template exist in dcat-core-keyword
* Add mapping for referenceSystem
* Add test
* Disable distribution for now and delegate to DCAT-AP for now.
Cardinality:
  * ISO 0..n
  * DCAT 0..n
  * DCAT-AP 0..1
  * Mobility DCAT 1..1 (in ISO either use corresponding period eg. P0Y0M0DT1H0M0S or extend the codelist with the proper vocabulary)

  accrualPeriodicity mapping done using the ISO to Dublin core value mapping
  but additional checks are done when ISO records extended the codelist and
  may used the EU Publication Office frequency codes
  or the Mobility DCAT-AP update frequency codes.
  Domain specific codelists take priority over the DC or ISO codelists.

  eg.
  <mmi:MD_MaintenanceFrequencyCode codeListValue="15min"/>

  multipleAccrualPeriodicityAllowed is a parameter that can be set to true to allow multiple accrualPeriodicity values.
  Default to false for EU formatters. true for DCAT.
@jahow
Copy link
Contributor

jahow commented Sep 23, 2024

Hi @fxprunayre,

Could you please clarify something for me: why do we need different export profiles (HVD, Mobility etc.)? Why can't we just output one graph that would provide the proper elements to fill the requirements for all of these profiles?

I feel like RDF allows creating graphs that are quite rich and contain many different statements targeting various consumers. Is there a reason, technical or other, for this profile-based approach? Thanks

@fxprunayre
Copy link
Member Author

why do we need different export profiles (HVD, Mobility etc.)? Why can't we just output one graph that would provide the proper elements to fill the requirements for all of these profiles?

Some opendata portals expect a particular DCAT profiles but true, we could mix all in one. But if you look to profile's model and SHACL validation, one element can be encoded using different encoding and can be optional in one profile, mandatory in others, or not having the same definition (eg. vocabularies) ... Looking at different versions of a profile, the encoding for an element can also change over time.
So it is important to do testing with other applications ingesting DCAT and see what is expected. We are currently using this PR with Belgium mobility and HVD portals.

https://semiceu.github.io/DCAT-AP/r5r/releases/3.0.0/#applicableLegislation

Add the element to DCAT-AP base. Element is 0..n in DCAT-AP and should be present in extensions
(mobility, hvd, geodcat).

HVD requires at least http://data.europa.eu/eli/reg_impl/2023/138/oj and
cardinality is 1..n.

Do not restrict to a particular legislation list. A sample vocabulary is
provided but it can be extended depending on catalogue domains.
In DCAT and DCAT-AP, `dct:identifier` is 0..n.
Mobility DCAT restrict it to 0..1.

In DCAT-AP and extensions, only convert the first identifier as
`dct:identifier`; others as `adms:identifier`.
Use `:` separator for URN like identifiers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

5 participants