diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml index 35d91633..2c92e0fc 100644 --- a/.github/workflows/test.yml +++ b/.github/workflows/test.yml @@ -54,11 +54,13 @@ jobs: pip install -e . # Replace default path to CKAN core config file with the one on the container sed -i -e 's/use = config:.*/use = config:\/srv\/app\/src\/ckan\/test-core.ini/' test.ini - - name: Setup harvest extension + - name: Setup other extension run: | git clone https://github.com/ckan/ckanext-harvest pip install -e ckanext-harvest - pip install -r ckanext-harvest/pip-requirements.txt + pip install -r ckanext-harvest/requirements.txt + git clone https://github.com/ckan/ckanext-scheming + pip install -e ckanext-scheming - name: Setup extension run: | ckan -c test.ini db init diff --git a/CHANGELOG.md b/CHANGELOG.md index c48332a9..c1aa08c7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,27 @@ ## [Unreleased](https://github.com/ckan/ckanext-dcat/compare/v1.7.0...HEAD) +* Support for standard CKAN [ckanext-scheming](https://github.com/ckan/ckanext-scheming) schemas. + The DCAT profiles now seamlessly integrate with fields defined via the YAML or JSON scheming files. + Sites willing to migrate to a scheming based metadata schema can do + so by adding the `euro_dcat_ap_scheming` profile at the end of their profile chain (e.g. + `ckanext.dcat.rdf.profiles = euro_dcat_ap_2 euro_dcat_ap_scheming`), which will modify the existing profile + outputs to the expected format by the scheming validators. Sample schemas are provided + in the `ckanext/dcat/schemas` folder. See the [documentation](https://github.com/ckan/ckanext-dcat?tab=readme-ov-file#schemas) + for all details. Some highlights of the new scheming based profiles: + + * Actual list support in the API ooutput for list properties like `dct:language` + * Multiple objects now allowed for properties like `dcat:ContactPoint`, `dct:spatial` or `dct:temporal` + * Custom validators for date values that allow `xsd:gYear`, `xsd:gYearMonth`, `xsd:date` and `xsd:dateTime` + + (#281) +* New `ckan dcat consume` and `ckan dcat produce` CLI commands (#279) +* Parse dcat:spatialResolutionInMeters as float (#285) +* Split profile classes into their own separate files (#282) +* Catch Not Authorized in View (#280) +* CKAN 2.11 support and requirements updates (#270) + + ## [v1.7.0](https://github.com/ckan/ckanext-dcat/compare/v1.6.0...v1.7.0) - 2024-04-04 * Adds support for the latest Hydra vocabulary. For backward compatibility, the old properties are still supported but marked as deprecated. (#267) diff --git a/README.md b/README.md index c79fc710..f050efdf 100644 --- a/README.md +++ b/README.md @@ -5,51 +5,66 @@ [![Code Coverage](http://codecov.io/github/ckan/ckanext-dcat/coverage.svg?branch=master)](http://codecov.io/github/ckan/ckanext-dcat?branch=master) -This extension provides plugins that allow CKAN to expose and consume metadata from other catalogs using RDF documents serialized using DCAT. The Data Catalog Vocabulary (DCAT) is "an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web". More information can be found on the following W3C page: +This extension provides plugins that allow CKAN to expose its metadata and consume metadata from other catalogs using RDF documents serialized using DCAT. The Data Catalog Vocabulary (DCAT) is "an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web". More information can be found on the following W3C page: [http://www.w3.org/TR/vocab-dcat](http://www.w3.org/TR/vocab-dcat) It also offers other features related to Semantic Data like exposing the necessary markup to get your datasets indexed in [Google Dataset Search](https://toolbox.google.com/datasetsearch). +Check the [overview](#overview) section for a summary of the available features. + ## Contents + + - [Overview](#overview) - [Installation](#installation) +- [Schemas](#schemas) + * [Compatibility with existing profiles](#compatibility-with-existing-profiles) - [RDF DCAT endpoints](#rdf-dcat-endpoints) - - [Dataset endpoints](#dataset-endpoints) - - [Catalog endpoint](#catalog-endpoint) - - [URIs](#uris) - - [Content negotiation](#content-negotiation) + * [Dataset endpoints](#dataset-endpoints) + * [Catalog endpoint](#catalog-endpoint) + * [URIs](#uris) + * [Content negotiation](#content-negotiation) - [RDF DCAT harvester](#rdf-dcat-harvester) - - [Maximum file size](#maximum-file-size) - - [Transitive harvesting](#transitive-harvesting) - - [Extending the RDF harvester](#extending-the-rdf-harvester) + * [Maximum file size](#maximum-file-size) + * [Transitive harvesting](#transitive-harvesting) + * [Extending the RDF harvester](#extending-the-rdf-harvester) - [JSON DCAT harvester](#json-dcat-harvester) - [RDF DCAT to CKAN dataset mapping](#rdf-dcat-to-ckan-dataset-mapping) + * [Custom fields](#custom-fields) + * [URIs](#uris-1) + * [Lists](#lists) + * [Contact points and Publisher](#contact-points-and-publisher) + * [Spatial coverage](#spatial-coverage) + * [Licenses](#licenses) - [RDF DCAT Parser](#rdf-dcat-parser) - [RDF DCAT Serializer](#rdf-dcat-serializer) + * [Inherit license from the dataset as fallback in distributions](#inherit-license-from-the-dataset-as-fallback-in-distributions) - [Profiles](#profiles) - - [Writing custom profiles](#writing-custom-profiles) - - [Command line interface](#command-line-interface) - - [Compatibility mode](#compatibility-mode) + * [Writing custom profiles](#writing-custom-profiles) + * [Command line interface](#command-line-interface) + * [Compatibility mode](#compatibility-mode) - [XML DCAT harvester (deprecated)](#xml-dcat-harvester-deprecated) - [Translation of fields](#translation-of-fields) -- [Structured Data and Google Dataset Search indexing](#structured-data-and-google-dataset-search-indexing) +- [Structured data and Google Dataset Search indexing](#structured-data-and-google-dataset-search-indexing) - [CLI](#cli) - [Running the Tests](#running-the-tests) - [Releases](#releases) - [Acknowledgements](#acknowledgements) - [Copying and License](#copying-and-license) -## Overview + -With the emergence of Open Data initiatives around the world, the need to share metadata across different catalogs has became more evident. Sites like [data.europa.eu](https://data.europa.eu/en) aggregate datasets from different portals, and there has been a growing demand to provide a clear and standard interface to allow incorporating metadata into them automatically. +## Overview -There is growing consensus around [DCAT](http://www.w3.org/TR/vocab-dcat) being the right way forward, but actual implementations are needed. This extension aims to provide tools and guidance to allow publishers to publish and share DCAT based metadata easily. +[DCAT](http://www.w3.org/TR/vocab-dcat) has become the basis for many metadata sharing standards, like DCAT-AP and DCAT-US for data portals in Europe and the USA respectively. This extension aims to provide tools and guidance to allow publishers to publish and share DCAT based metadata easily. In terms of CKAN features, this extension offers: +* [Pre-built CKAN schemas](#schemas) for common Application Profiles that can be adapted to each site requirement to provide out-of-the -box DCAT support in data portals. + * [RDF DCAT Endpoints](#rdf-dcat-endpoints) that expose the catalog's datasets in different RDF serializations (`dcat` plugin). * An [RDF Harvester](#rdf-dcat-harvester) that allows importing RDF serializations from other catalogs to create CKAN datasets (`dcat_rdf_harvester` plugin). @@ -69,20 +84,66 @@ These are implemented internally using: ## Installation -1. Install ckanext-harvest ([https://github.com/ckan/ckanext-harvest#installation](https://github.com/ckan/ckanext-harvest#installation)) (Only if you want to use the RDF harvester) -2. Install the extension on your virtualenv: +1. Install the extension on your virtualenv: (pyenv) $ pip install -e git+https://github.com/ckan/ckanext-dcat.git#egg=ckanext-dcat -3. Install the extension requirements: +2. Install the extension requirements: (pyenv) $ pip install -r ckanext-dcat/requirements.txt -4. Enable the required plugins in your ini file: +3. Enable the required plugins in your ini file: ckan.plugins = dcat dcat_rdf_harvester dcat_json_harvester dcat_json_interface structured_data +4. To use the pre-built schemas, install [ckanext-scheming](https://github.com/ckan/ckanext-scheming): + + pip install -e "git+https://github.com/ckan/ckanext-scheming.git#egg=ckanext-scheming" + +Check the [Schemas](#schemas) section for extra configuration needed. + +Optionally, if you want to use the RDF harvester, install ckanext-harvest as well ([https://github.com/ckan/ckanext-harvest#installation](https://github.com/ckan/ckanext-harvest#installation)). + +## Schemas + +The extension includes ready to use [ckanext-scheming](https://github.com/ckan/ckanext-scheming) schemas that enable DCAT support. These include a schema definition file (located in `ckanext/dcat/schemas`) plus extra validators and other custom logic that integrates the metadata modifications with the RDF DCAT [Parsers](#rdf-dcat-parser) and [Serializers](#rdf-dcat-serializer) and other CKAN features and extensions. + +There are the following schemas currently included with the extension: + +* *dcat_ap_2.1_recommended.yaml*: Includes the recommended properties for `dcat:Dataset` and `dcat:Distribution` according to the [DCAT 2.1](https://semiceu.github.io/DCAT-AP/releases/2.1.1/) specification. +* *dcat_ap_2.1_full.yaml*: Includes most of the properties defined for `dcat:Dataset` and `dcat:Distribution` in the [DCAT 2.1](https://semiceu.github.io/DCAT-AP/releases/2.1.1/) specification. + +Most sites will want to use these as a base to create their own custom schema to address their own requirements, perhaps alongside a [custom profile](#writing-custom-profiles). Of course site maintainers can add or remove schema fields, as well as change the existing validators. + +In any case, the schema file used should be defined in the configuration file, alongside these configuration options: + + # Make sure to add scheming_datasets after the dcat plugin + ckan.plugins = activity dcat [...] scheming_datasets + + # Point to one of the defaults or your own version of the schema file + scheming.dataset_schemas = ckanext.dcat.schemas:dcat_ap_2.1_recommended.yaml + + # Include the dcat presets as well as the standard scheming ones + scheming.presets = ckanext.scheming:presets.json ckanext.dcat.schemas:presets.yaml + + # Sites using the euro_dcat_ap and euro_dcat_ap_2 profiles must add the + # euro_dcat_ap_scheming profile if they want to use ckanext-scheming schemas (see next section) + ckanext.dcat.rdf.profiles = euro_dcat_ap_2 euro_dcat_ap_scheming + +### Compatibility with existing profiles + +Sites using the existing `euro_dcat_ap` and `euro_dcat_ap_2` profiles should not see any change in their +current parsing and serialization functionalities and these profiles will not change their outputs going +forward (unless a bug is being fixed). Sites willing to migrate to a scheming based metadata schema can do +so by adding the `euro_dcat_ap_scheming` profile at the end of their profile chain (e.g. +`ckanext.dcat.rdf.profiles = euro_dcat_ap_2 euro_dcat_ap_scheming`), which will modify the existing profile +outputs to the expected format by the scheming validators. + +Note that the scheming profile will only affect fields defined in the schema definition file, so sites can start migrating gradually different metadata fields. + + + ## RDF DCAT endpoints By default when the `dcat` plugin is enabled, the following RDF endpoints are available on your CKAN instance. The schema used on the serializations can be customized using [profiles](#profiles). @@ -308,69 +369,71 @@ To enable the JSON harvester, add the `dcat_json_harvester` plugin to your CKAN ## RDF DCAT to CKAN dataset mapping The following table provides a generic mapping between the fields of the `dcat:Dataset` and `dcat:Distribution` classes and -their equivalents on the CKAN model. In most cases this mapping is deliberately a loose one. For instance, it does not try to link +their equivalents in the CKAN model. In most cases this mapping is deliberately a loose one. For instance, it does not try to link the DCAT publisher property with a CKAN dataset author, maintainer or organization, as the link between them is not straight-forward and may depend on a particular instance needs. When mapping from CKAN metadata to DCAT though, there are in some cases fallback fields that are used if the default field is not present (see [RDF Serializer](#rdf-dcat-serializer) for more details on this. This mapping is compatible with the [DCAT-AP v1.1](https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-ap-v11) and [DCAT-AP v2.1](https://joinup.ec.europa.eu/collection/semantic-interoperability-community-semic/solution/dcat-application-profile-data-portals-europe/release/210). It depends on the active profile(s) (see [Profiles](#profiles)) which DCAT properties are mapped. +Sites are encouraged to use ckanext-scheming to manage their metadata schema (see [Schemas](#schemas) for all details). This changes in +some cases the way metadata is stored internally and presented at the CKAN API level, but should not affect the RDF DCAT output. | DCAT class | DCAT property | CKAN dataset field | CKAN fallback fields | Stored as | | |-------------------|------------------------|-------------------------------------------|--------------------------------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------| -| dcat:Dataset | - | extra:uri | | text | See note about URIs | +| dcat:Dataset | - | extra:uri | | text | See [URIs](#uris-1) | | dcat:Dataset | dct:title | title | | text | | | dcat:Dataset | dct:description | notes | | text | | | dcat:Dataset | dcat:keyword | tags | | text | | -| dcat:Dataset | dcat:theme | extra:theme | | list | See note about lists | +| dcat:Dataset | dcat:theme | extra:theme | | list | See [Lists](#lists) | | dcat:Dataset | dct:identifier | extra:identifier | extra:guid, id | text | | | dcat:Dataset | adms:identifier | extra:alternate_identifier | | text | | | dcat:Dataset | dct:issued | extra:issued | metadata_created | text | | | dcat:Dataset | dct:modified | extra:modified | metadata_modified | text | | | dcat:Dataset | owl:versionInfo | version | extra:dcat_version | text | | | dcat:Dataset | adms:versionNotes | extra:version_notes | | text | | -| dcat:Dataset | dct:language | extra:language | | list | See note about lists | +| dcat:Dataset | dct:language | extra:language | | list | See [Lists](#lists) | | dcat:Dataset | dcat:landingPage | url | | text | | | dcat:Dataset | dct:accrualPeriodicity | extra:frequency | | text | | -| dcat:Dataset | dct:conformsTo | extra:conforms_to | | list | See note about lists | +| dcat:Dataset | dct:conformsTo | extra:conforms_to | | list | See [Lists](#lists) | | dcat:Dataset | dct:accessRights | extra:access_rights | | text | | -| dcat:Dataset | foaf:page | extra:documentation | | list | See note about lists | +| dcat:Dataset | foaf:page | extra:documentation | | list | See [Lists](#lists) | | dcat:Dataset | dct:provenance | extra:provenance | | text | | | dcat:Dataset | dct:type | extra:dcat_type | | text | As of DCAT-AP v1.1 there's no controlled vocabulary for this field | -| dcat:Dataset | dct:hasVersion | extra:has_version | | list | See note about lists. It is assumed that these are one or more URIs referring to another dcat:Dataset | -| dcat:Dataset | dct:isVersionOf | extra:is_version_of | | list | See note about lists. It is assumed that these are one or more URIs referring to another dcat:Dataset | -| dcat:Dataset | dct:source | extra:source | | list | See note about lists. It is assumed that these are one or more URIs referring to another dcat:Dataset | -| dcat:Dataset | adms:sample | extra:sample | | list | See note about lists. It is assumed that these are one or more URIs referring to dcat:Distribution instances | -| dcat:Dataset | dct:spatial | extra:spatial_uri | | text | If the RDF provides them, profiles should store the textual and geometric representation of the location in extra:spatial_text, extra:spatial, extra:spatial_bbox and extra:spatial_centroid respectively | +| dcat:Dataset | dct:hasVersion | extra:has_version | | list | See [Lists](#lists). It is assumed that these are one or more URIs referring to another dcat:Dataset | +| dcat:Dataset | dct:isVersionOf | extra:is_version_of | | list | See [Lists](#lists). It is assumed that these are one or more URIs referring to another dcat:Dataset | +| dcat:Dataset | dct:source | extra:source | | list | See [Lists](#lists). It is assumed that these are one or more URIs referring to another dcat:Dataset | +| dcat:Dataset | adms:sample | extra:sample | | list | See [Lists](#lists). It is assumed that these are one or more URIs referring to dcat:Distribution instances | +| dcat:Dataset | dct:spatial | extra:spatial_uri | | text | See [Spatial coverage](#spatial-coverage) | | dcat:Dataset | dct:temporal | extra:temporal_start + extra:temporal_end | | text | None, one or both extras can be present | | dcat:Dataset | dcat:temporalResolution| extra:temporal_resolution | | list | | | dcat:Dataset | dcat:spatialResolutionInMeters| extra:spatial_resolution_in_meters | | list | | | dcat:Dataset | dct:isReferencedBy | extra:is_referenced_by | | list | | -| dcat:Dataset | dct:publisher | extra:publisher_uri | | text | See note about URIs | +| dcat:Dataset | dct:publisher | extra:publisher_uri | | text | See [URIs](#uris-1) and [Publisher](#contact-points-and-publisher) | | foaf:Agent | foaf:name | extra:publisher_name | | text | | | foaf:Agent | foaf:mbox | extra:publisher_email | organization:title | text | | | foaf:Agent | foaf:homepage | extra:publisher_url | | text | | | foaf:Agent | dct:type | extra:publisher_type | | text | | -| dcat:Dataset | dcat:contactPoint | extra:contact_uri | | text | See note about URIs | +| dcat:Dataset | dcat:contactPoint | extra:contact_uri | | text | See [URIs](#uris-1) and [Contact points](#contact-points-and-publisher) | | vcard:Kind | vcard:fn | extra:contact_name | maintainer, author | text | | | vcard:Kind | vcard:hasEmail | extra:contact_email | maintainer_email, author_email | text | | | dcat:Dataset | dcat:distribution | resources | | text | | -| dcat:Distribution | - | resource:uri | | text | See note about URIs | +| dcat:Distribution | - | resource:uri | | text | See [URIs](#uris-1) | | dcat:Distribution | dct:title | resource:name | | text | | | dcat:Distribution | dcat:accessURL | resource:access_url | resource:url | text | If downloadURL is not present, accessURL will be used as resource url | | dcat:Distribution | dcat:downloadURL | resource:download_url | | text | If present, downloadURL will be used as resource url | | dcat:Distribution | dct:description | resource:description | | text | | | dcat:Distribution | dcat:mediaType | resource:mimetype | | text | | -| dcat:Distribution | dct:format | resource:format | | text | This is likely to require extra logic to accommodate how CKAN deals with formats (eg ckan/ckanext-dcat#18) | -| dcat:Distribution | dct:license | resource:license | | text | See note about dataset license | +| dcat:Distribution | dct:format | resource:format | | text | | +| dcat:Distribution | dct:license | resource:license | | text | See [Licenses](#licenses) | | dcat:Distribution | adms:status | resource:status | | text | | | dcat:Distribution | dcat:byteSize | resource:size | | number | | | dcat:Distribution | dct:issued | resource:issued | created | text | | | dcat:Distribution | dct:modified | resource:modified | metadata_modified | text | | | dcat:Distribution | dct:rights | resource:rights | | text | | -| dcat:Distribution | foaf:page | resource:documentation | | list | See note about lists | -| dcat:Distribution | dct:language | resource:language | | list | See note about lists | -| dcat:Distribution | dct:conformsTo | resource:conforms_to | | list | See note about lists | +| dcat:Distribution | foaf:page | resource:documentation | | list | See [Lists](#lists) | +| dcat:Distribution | dct:language | resource:language | | list | See [Lists](#lists) | +| dcat:Distribution | dct:conformsTo | resource:conforms_to | | list | See [Lists](#lists) | | dcat:Distribution | dcatap:availability | resource:availability | | text | | | dcat:Distribution | dcat:compressFormat | resource:compress_format | | text | | | dcat:Distribution | dcat:packageFormat | resource:package_format | | text | | @@ -388,8 +451,33 @@ This mapping is compatible with the [DCAT-AP v1.1](https://joinup.ec.europa.eu/a *Notes* -* Whenever possible, URIs are extracted and stored so there is a clear reference to the original RDF resource. - For instance: +### Custom fields + +Fields marked as `extra:` are stored as free form extras in the `euro_dcat_ap` and `euro_dcat_ap_2` profiles, +but stored as first level custom fields when using the scheming based profile (`euro_dcat_ap_scheming`), i.e: + + ```json + { + "name": "test_dataset_dcat", + "extras": [ + {"key": "version_notes", "value": "Some version notes"} + ] + } + ``` + + vs: + + ```json + { + "name": "test_dataset_dcat", + "version_notes": "Some version notes" + } + ``` + +### URIs + +Whenever possible, URIs are extracted and stored so there is a clear reference to the original RDF resource. +For instance: ```xml @@ -456,7 +544,9 @@ This mapping is compatible with the [DCAT-AP v1.1](https://joinup.ec.europa.eu/a } ``` -* Lists are stored as a JSON string, eg: +### Lists + +On the legacy profiles, lists are stored as a JSON string, eg: ``` @prefix dcat: . @@ -481,7 +571,58 @@ This mapping is compatible with the [DCAT-AP v1.1](https://joinup.ec.europa.eu/a } ``` -* The following formats for `dct:spatial` are supported by the default [parser](#rdf-dcat-parser). Note that the default [serializer](#rdf-dcat-serializer) will return the single `dct:spatial` instance form by default. +On the scheming-based ones, these are shown as actual lists: + + ```json + { + "title": "Dataset 1", + "uri": "http://data.some.org/catalog/datasets/1"}, + "language": ["ca", "en", "es"] + "theme": ["Earth Sciences", "http://eurovoc.europa.eu/209065", "http://eurovoc.europa.eu/100142"] + } + ``` +### Contact points and Publisher + +Properties for `dcat:contactPoint` and `dct:publisher` are stored as namespaced extras in the legacy profiles. When using +a scheming-based profile, these are stored as proper objects (and multiple instances are allowed for contact point): + +```json +{ + "name": "test_dataset_dcat", + "title": "Test dataset DCAT", + "extras": [ + {"key":"contact_name","value":"PointofContact"}, + {"key":"contact_email","value":"contact@some.org"} + ], +} +``` + +vs: + +```json +{ + "name": "test_dataset_dcat", + "title": "Test dataset DCAT", + "contact": [ + { + "name": "Point of Contact 1", + "email": "contact1@some.org" + }, + { + "name": "Point of Contact 2", + "email": "contact2@some.org" + }, + ] +} +``` + +If no `publisher` or `publisher_*` fields are found, the serializers will fall back to getting the publisher properties from the organization the CKAN dataset belongs to. The organization schema can be customized with the schema located in `ckanext/dcat/schemas/publisher_organization.yaml` to provide the extra properties supported (this will additionally require loading the `scheming_organizations` plugin in `ckan.plugins`). + + +### Spatial coverage + + +The following formats for `dct:spatial` are supported by the default [parser](#rdf-dcat-parser). Note that the default [serializer](#rdf-dcat-serializer) will return the single `dct:spatial` instance form by default. - One `dct:spatial` instance, URI only @@ -531,8 +672,45 @@ This mapping is compatible with the [DCAT-AP v1.1](https://joinup.ec.europa.eu/a ``` +If the RDF provides them, profiles should store the textual and geometric representation of the location in: + +* For legacy profiles in `spatial_text`, `spatial_bbox`, `spatial_centroid` or `spatial` (for any other geometries) extra fields +* For scheming-based profiles in objects in the `spatial_coverage` field, for instance: + +```json +{ + "name": "test_dataset_dcat", + "title": "Test dataset DCAT", + "spatial_coverage": [ + { + "geom": { + "type": "Polygon", + "coordinates": [...] + }, + "text": "Tarragona", + "uri": "https://sws.geonames.org/6361390/", + "bbox": { + "type": "Polygon", + "coordinates": [ + [ + [-2.1604, 42.7611], + [-2.0938, 42.7611], + [-2.0938, 42.7931], + [-2.1604, 42.7931], + [-2.1604, 42.7611], + ] + ], + }, + "centroid": {"type": "Point", "coordinates": [1.26639, 41.12386]}, + } + ] +} +``` + + +### Licenses -* On the CKAN model, license is at the dataset level whereas in DCAT model it +On the CKAN model, license is at the dataset level whereas in DCAT model it is at distributions level. By default the RDF parser will try to find a distribution with a license that matches one of those registered in CKAN and attach this license to the dataset. The first matching distribution's diff --git a/ckanext/dcat/plugins/__init__.py b/ckanext/dcat/plugins/__init__.py index 617e8d4b..2aef170b 100644 --- a/ckanext/dcat/plugins/__init__.py +++ b/ckanext/dcat/plugins/__init__.py @@ -2,6 +2,7 @@ from builtins import object import os +import json from ckantoolkit import config @@ -19,6 +20,7 @@ dcat_auth, ) from ckanext.dcat import utils +from ckanext.dcat.validators import dcat_validators CUSTOM_ENDPOINT_CONFIG = 'ckanext.dcat.catalog_endpoint' @@ -28,6 +30,19 @@ I18N_DIR = os.path.join(HERE, u"../i18n") +def _get_dataset_schema(dataset_type="dataset"): + schema = None + try: + schema_show = p.toolkit.get_action("scheming_dataset_schema_show") + try: + schema = schema_show({}, {"type": dataset_type}) + except p.toolkit.ObjectNotFound: + pass + except KeyError: + pass + return schema + + class DCATPlugin(p.SingletonPlugin, DefaultTranslation): p.implements(p.IConfigurer, inherit=True) @@ -38,6 +53,7 @@ class DCATPlugin(p.SingletonPlugin, DefaultTranslation): p.implements(p.ITranslation, inherit=True) p.implements(p.IClick) p.implements(p.IBlueprint) + p.implements(p.IValidators) # IClick @@ -101,17 +117,31 @@ def get_auth_functions(self): 'dcat_catalog_search': dcat_auth, } + # IValidators + def get_validators(self): + return dcat_validators + # IPackageController # CKAN < 2.10 hooks def after_show(self, context, data_dict): return self.after_dataset_show(context, data_dict) + def before_index(self, dataset_dict): + return self.before_dataset_index(dataset_dict) + # CKAN >= 2.10 hooks def after_dataset_show(self, context, data_dict): + schema = _get_dataset_schema(data_dict["type"]) # check if config is enabled to translate keys (default: True) - if not p.toolkit.asbool(config.get(TRANSLATE_KEYS_CONFIG, True)): + # skip if scheming is enabled, as this will be handled there + translate_keys = ( + p.toolkit.asbool(config.get(TRANSLATE_KEYS_CONFIG, True)) + and not schema + ) + + if not translate_keys: return data_dict if context.get('for_view'): @@ -132,6 +162,52 @@ def set_titles(object_dict): return data_dict + def before_dataset_index(self, dataset_dict): + schema = _get_dataset_schema(dataset_dict["type"]) + spatial = None + if schema: + for field in schema['dataset_fields']: + if field['field_name'] in dataset_dict and 'repeating_subfields' in field: + for item in dataset_dict[field['field_name']]: + for key in item: + value = item[key] + if not isinstance(value, dict): + # Index a flattened version + new_key = f'extras_{field["field_name"]}__{key}' + if not dataset_dict.get(new_key): + dataset_dict[new_key] = value + else: + dataset_dict[new_key] += ' ' + value + + subfields = dataset_dict.pop(field['field_name'], None) + if field['field_name'] == 'spatial_coverage': + spatial = subfields + + # Store the first geometry found so ckanext-spatial can pick it up for indexing + def _check_for_a_geom(spatial_dict): + value = None + + for field in ('geom', 'bbox', 'centroid'): + if spatial_dict.get(field): + value = spatial_dict[field] + if isinstance(value, dict): + try: + value = json.dumps(value) + break + except ValueError: + pass + return value + + if spatial and not dataset_dict.get('spatial'): + for item in spatial: + value = _check_for_a_geom(item) + if value: + dataset_dict['spatial'] = value + dataset_dict['extras_spatial'] = value + break + + return dataset_dict + class DCATJSONInterface(p.SingletonPlugin): p.implements(p.IActions) diff --git a/ckanext/dcat/processors.py b/ckanext/dcat/processors.py index e6093443..92b15c4a 100644 --- a/ckanext/dcat/processors.py +++ b/ckanext/dcat/processors.py @@ -33,12 +33,15 @@ class RDFProcessor(object): - def __init__(self, profiles=None, compatibility_mode=False): + def __init__(self, profiles=None, dataset_type='dataset', compatibility_mode=False): ''' Creates a parser or serializer instance You can optionally pass a list of profiles to be used. + A scheming dataset type can be provided, in which case the scheming schema + will be loaded by the base profile so it can be used by other profiles. + In compatibility mode, some fields are modified to maintain compatibility with previous versions of the ckanext-dcat parsers (eg adding the `dcat_` prefix or storing comma separated lists instead @@ -56,6 +59,8 @@ def __init__(self, profiles=None, compatibility_mode=False): raise RDFProfileException( 'No suitable RDF profiles could be loaded') + self.dataset_type = dataset_type + if not compatibility_mode: compatibility_mode = p.toolkit.asbool( config.get(COMPAT_MODE_CONFIG_OPTION, False)) @@ -177,11 +182,16 @@ def datasets(self): for dataset_ref in self._datasets(): dataset_dict = {} for profile_class in self._profiles: - profile = profile_class(self.g, self.compatibility_mode) + profile = profile_class( + self.g, + dataset_type=self.dataset_type, + compatibility_mode=self.compatibility_mode + ) profile.parse_dataset(dataset_dict, dataset_ref) yield dataset_dict + class RDFSerializer(RDFProcessor): ''' A CKAN to RDF serializer based on rdflib @@ -245,7 +255,7 @@ def graph_from_dataset(self, dataset_dict): dataset_ref = URIRef(dataset_uri(dataset_dict)) for profile_class in self._profiles: - profile = profile_class(self.g, self.compatibility_mode) + profile = profile_class(self.g, compatibility_mode=self.compatibility_mode) profile.graph_from_dataset(dataset_dict, dataset_ref) return dataset_ref @@ -263,7 +273,7 @@ def graph_from_catalog(self, catalog_dict=None): catalog_ref = URIRef(catalog_uri()) for profile_class in self._profiles: - profile = profile_class(self.g, self.compatibility_mode) + profile = profile_class(self.g, compatibility_mode=self.compatibility_mode) profile.graph_from_catalog(catalog_dict, catalog_ref) return catalog_ref diff --git a/ckanext/dcat/profiles/__init__.py b/ckanext/dcat/profiles/__init__.py index 92266c72..a80a48c6 100644 --- a/ckanext/dcat/profiles/__init__.py +++ b/ckanext/dcat/profiles/__init__.py @@ -20,4 +20,5 @@ from .euro_dcat_ap import EuropeanDCATAPProfile from .euro_dcat_ap_2 import EuropeanDCATAP2Profile +from .euro_dcat_ap_scheming import EuropeanDCATAPSchemingProfile from .schemaorg import SchemaOrgProfile diff --git a/ckanext/dcat/profiles/base.py b/ckanext/dcat/profiles/base.py index 80711b3f..d1ff561b 100644 --- a/ckanext/dcat/profiles/base.py +++ b/ckanext/dcat/profiles/base.py @@ -7,10 +7,11 @@ from rdflib.namespace import Namespace, RDF, XSD, SKOS, RDFS from geomet import wkt, InvalidGeoJSONException -from ckantoolkit import config, url_for, asbool, get_action +from ckantoolkit import config, url_for, asbool, get_action, ObjectNotFound from ckan.model.license import LicenseRegister from ckan.lib.helpers import resource_formats from ckanext.dcat.utils import DCAT_EXPOSE_SUBCATALOGS +from ckanext.dcat.validators import is_year, is_year_month, is_date DCT = Namespace("http://purl.org/dc/terms/") DCAT = Namespace("http://www.w3.org/ns/dcat#") @@ -41,10 +42,23 @@ "spdx": SPDX, } -PREFIX_MAILTO = u"mailto:" +PREFIX_MAILTO = "mailto:" GEOJSON_IMT = "https://www.iana.org/assignments/media-types/application/vnd.geo+json" +ROOT_DATASET_FIELDS = [ + 'name', + 'title', + 'url', + 'version', + 'tags', + 'license_id', + 'maintainer', + 'maintainer_email', + 'author', + 'author_email', +] + class URIRefOrLiteral(object): """Helper which creates an URIRef if the value appears to be an http URL, @@ -105,11 +119,20 @@ class RDFProfile(object): custom profiles """ - def __init__(self, graph, compatibility_mode=False): - """Class constructor + _dataset_schema = None - Graph is an rdflib.Graph instance. + # Cache for mappings of licenses URL/title to ID built when needed in + # _license(). + _licenceregister_cache = None + # Cache for organization_show details (used for publisher fallback) + _org_cache: dict = {} + + def __init__(self, graph, dataset_type="dataset", compatibility_mode=False): + """Class constructor + Graph is an rdflib.Graph instance. + A scheming dataset type can be provided, in which case the scheming schema + will be loaded so it can be used by profiles. In compatibility mode, some fields are modified to maintain compatibility with previous versions of the ckanext-dcat parsers (eg adding the `dcat_` prefix or storing comma separated lists instead @@ -120,9 +143,17 @@ def __init__(self, graph, compatibility_mode=False): self.compatibility_mode = compatibility_mode - # Cache for mappings of licenses URL/title to ID built when needed in - # _license(). - self._licenceregister_cache = None + try: + schema_show = get_action("scheming_dataset_schema_show") + try: + schema = schema_show({}, {"type": dataset_type}) + except ObjectNotFound: + raise ObjectNotFound(f"Unknown dataset schema: {dataset_type}") + + self._dataset_schema = schema + + except KeyError: + pass def _datasets(self): """ @@ -682,7 +713,7 @@ def _read_list_value(self, value): # List of values if isinstance(value, list): items = value - elif isinstance(value, str): + elif value and isinstance(value, str): try: items = json.loads(value) if isinstance(items, ((int, float, complex))): @@ -703,17 +734,19 @@ def _add_spatial_value_to_graph(self, spatial_ref, predicate, value): self.g.add((spatial_ref, predicate, Literal(value, datatype=GEOJSON_IMT))) # WKT, because GeoDCAT-AP says so try: + if isinstance(value, str): + value = json.loads(value) self.g.add( ( spatial_ref, predicate, Literal( - wkt.dumps(json.loads(value), decimals=4), + wkt.dumps(value, decimals=4), datatype=GSP.wktLiteral, ), ) ) - except (TypeError, ValueError, InvalidGeoJSONException): + except (TypeError, ValueError, InvalidGeoJSONException) as e: pass def _add_spatial_to_dict(self, dataset_dict, key, spatial): @@ -725,6 +758,64 @@ def _add_spatial_to_dict(self, dataset_dict, key, spatial): } ) + def _schema_field(self, key): + """ + Returns the schema field information if the provided key exists as a field in + the dataset schema (if one was provided) + """ + if not self._dataset_schema: + return None + + for field in self._dataset_schema["dataset_fields"]: + if field["field_name"] == key: + return field + + def _schema_resource_field(self, key): + """ + Returns the schema field information if the provided key exists as a field in + the resources fields of the dataset schema (if one was provided) + """ + if not self._dataset_schema: + return None + + for field in self._dataset_schema["resource_fields"]: + if field["field_name"] == key: + return field + + def _set_dataset_value(self, dataset_dict, key, value): + """ + Sets the value for a given key in a CKAN dataset dict + If a dataset schema was provided, the schema will be checked to see if + a custom field is present for the key. If so the key will be stored at + the dict root level, otherwise it will be stored as an extra. + Standard CKAN fields (defined in ROOT_DATASET_FIELDS) are always stored + at the root level. + """ + if self._schema_field(key) or key in ROOT_DATASET_FIELDS: + dataset_dict[key] = value + else: + if not dataset_dict.get("extras"): + dataset_dict["extras"] = [] + dataset_dict["extras"].append({"key": key, "value": value}) + + return dataset_dict + + def _set_list_dataset_value(self, dataset_dict, key, value): + schema_field = self._schema_field(key) + if schema_field and "scheming_multiple_text" in schema_field["validators"]: + return self._set_dataset_value(dataset_dict, key, value) + else: + return self._set_dataset_value(dataset_dict, key, json.dumps(value)) + + def _set_list_resource_value(self, resource_dict, key, value): + schema_field = self._schema_resource_field(key) + if schema_field and "scheming_multiple_text" in schema_field["validators"]: + resource_dict[key] = value + else: + resource_dict[key] = json.dumps(value) + + return resource_dict + def _get_dataset_value(self, dataset_dict, key, default=None): """ Returns the value for the given key on a CKAN dict @@ -844,22 +935,31 @@ def _add_date_triple(self, subject, predicate, value, _type=Literal): """ Adds a new triple with a date object - Dates are parsed using dateutil, and if the date obtained is correct, - added to the graph as an XSD.dateTime value. + If the value is one of xsd:gYear, xsd:gYearMonth or xsd:date. If not + the value will be parsed using dateutil, and if the date obtained is correct, + added to the graph as an xsd:dateTime value. If there are parsing errors, the literal string value is added. """ if not value: return - try: - default_datetime = datetime.datetime(1, 1, 1, 0, 0, 0) - _date = parse_date(value, default=default_datetime) - self.g.add( - (subject, predicate, _type(_date.isoformat(), datatype=XSD.dateTime)) - ) - except ValueError: - self.g.add((subject, predicate, _type(value))) + if is_year(value): + self.g.add((subject, predicate, _type(value, datatype=XSD.gYear))) + elif is_year_month(value): + self.g.add((subject, predicate, _type(value, datatype=XSD.gYearMonth))) + elif is_date(value): + self.g.add((subject, predicate, _type(value, datatype=XSD.date))) + else: + try: + default_datetime = datetime.datetime(1, 1, 1, 0, 0, 0) + _date = parse_date(value, default=default_datetime) + + self.g.add( + (subject, predicate, _type(_date.isoformat(), datatype=XSD.dateTime)) + ) + except ValueError: + self.g.add((subject, predicate, _type(value))) def _last_catalog_modification(self): """ @@ -898,7 +998,7 @@ def _without_mailto(self, mail_addr): Ensures that the mail address string has no mailto: prefix. """ if mail_addr: - return str(mail_addr).replace(PREFIX_MAILTO, u"") + return str(mail_addr).replace(PREFIX_MAILTO, "") else: return mail_addr diff --git a/ckanext/dcat/profiles/euro_dcat_ap.py b/ckanext/dcat/profiles/euro_dcat_ap.py index 9a4c853b..b7e4cae4 100644 --- a/ckanext/dcat/profiles/euro_dcat_ap.py +++ b/ckanext/dcat/profiles/euro_dcat_ap.py @@ -20,11 +20,9 @@ DCAT, DCT, ADMS, - XSD, VCARD, FOAF, SCHEMA, - SKOS, LOCN, GSP, OWL, @@ -354,51 +352,66 @@ def graph_from_dataset(self, dataset_dict, dataset_ref): ) # Publisher - if any( + publisher_ref = None + + if dataset_dict.get("publisher"): + # Scheming publisher field: will be handled in a separate profile + pass + elif any( [ self._get_dataset_value(dataset_dict, "publisher_uri"), self._get_dataset_value(dataset_dict, "publisher_name"), - dataset_dict.get("organization"), ] ): - + # Legacy publisher_* extras publisher_uri = self._get_dataset_value(dataset_dict, "publisher_uri") - publisher_uri_fallback = publisher_uri_organization_fallback(dataset_dict) publisher_name = self._get_dataset_value(dataset_dict, "publisher_name") if publisher_uri: - publisher_details = CleanedURIRef(publisher_uri) - elif not publisher_name and publisher_uri_fallback: - # neither URI nor name are available, use organization as fallback - publisher_details = CleanedURIRef(publisher_uri_fallback) + publisher_ref = CleanedURIRef(publisher_uri) else: # No publisher_uri - publisher_details = BNode() - - g.add((publisher_details, RDF.type, FOAF.Organization)) - g.add((dataset_ref, DCT.publisher, publisher_details)) - - # In case no name and URI are available, again fall back to organization. - # If no name but an URI is available, the name literal remains empty to - # avoid mixing organization and dataset values. - if ( - not publisher_name - and not publisher_uri - and dataset_dict.get("organization") - ): - publisher_name = dataset_dict["organization"]["title"] - - g.add((publisher_details, FOAF.name, Literal(publisher_name))) - # TODO: It would make sense to fallback these to organization - # fields but they are not in the default schema and the - # `organization` object in the dataset_dict does not include - # custom fields + publisher_ref = BNode() + publisher_details = { + "name": publisher_name, + "email": self._get_dataset_value(dataset_dict, "publisher_email"), + "url": self._get_dataset_value(dataset_dict, "publisher_url"), + "type": self._get_dataset_value(dataset_dict, "publisher_type"), + } + elif dataset_dict.get("organization"): + # Fall back to dataset org + org_id = dataset_dict["organization"]["id"] + org_dict = None + if org_id in self._org_cache: + org_dict = self._org_cache[org_id] + else: + try: + org_dict = toolkit.get_action("organization_show")( + {"ignore_auth": True}, {"id": org_id} + ) + self._org_cache[org_id] = org_dict + except toolkit.ObjectNotFound: + pass + if org_dict: + publisher_ref = CleanedURIRef( + publisher_uri_organization_fallback(dataset_dict) + ) + publisher_details = { + "name": org_dict.get("title"), + "email": org_dict.get("email"), + "url": org_dict.get("url"), + "type": org_dict.get("dcat_type"), + } + # Add to graph + if publisher_ref: + g.add((publisher_ref, RDF.type, FOAF.Organization)) + g.add((dataset_ref, DCT.publisher, publisher_ref)) items = [ - ("publisher_email", FOAF.mbox, None, Literal), - ("publisher_url", FOAF.homepage, None, URIRef), - ("publisher_type", DCT.type, None, URIRefOrLiteral), + ("name", FOAF.name, None, Literal), + ("email", FOAF.mbox, None, Literal), + ("url", FOAF.homepage, None, URIRef), + ("type", DCT.type, None, URIRefOrLiteral), ] - - self._add_triples_from_dict(dataset_dict, publisher_details, items) + self._add_triples_from_dict(publisher_details, publisher_ref, items) # Temporal start = self._get_dataset_value(dataset_dict, "temporal_start") diff --git a/ckanext/dcat/profiles/euro_dcat_ap_2.py b/ckanext/dcat/profiles/euro_dcat_ap_2.py index 5e699303..c1f9274f 100644 --- a/ckanext/dcat/profiles/euro_dcat_ap_2.py +++ b/ckanext/dcat/profiles/euro_dcat_ap_2.py @@ -91,56 +91,52 @@ def parse_dataset(self, dataset_dict, dataset_ref): if values: resource_dict[key] = json.dumps(values) - # Access services - access_service_list = [] + # Access services + access_service_list = [] - for access_service in self.g.objects( - distribution, DCAT.accessService + for access_service in self.g.objects( + distribution, DCAT.accessService + ): + access_service_dict = {} + + # Simple values + for key, predicate in ( + ("availability", DCATAP.availability), + ("title", DCT.title), + ("endpoint_description", DCAT.endpointDescription), + ("license", DCT.license), + ("access_rights", DCT.accessRights), + ("description", DCT.description), + ): + value = self._object_value(access_service, predicate) + if value: + access_service_dict[key] = value + # List + for key, predicate in ( + ("endpoint_url", DCAT.endpointURL), + ("serves_dataset", DCAT.servesDataset), ): - access_service_dict = {} - - # Simple values - for key, predicate in ( - ("availability", DCATAP.availability), - ("title", DCT.title), - ("endpoint_description", DCAT.endpointDescription), - ("license", DCT.license), - ("access_rights", DCT.accessRights), - ("description", DCT.description), - ): - value = self._object_value(access_service, predicate) - if value: - access_service_dict[key] = value - # List - for key, predicate in ( - ("endpoint_url", DCAT.endpointURL), - ("serves_dataset", DCAT.servesDataset), - ): - values = self._object_value_list( - access_service, predicate - ) - if values: - access_service_dict[key] = values - - # Access service URI (explicitly show the missing ones) - access_service_dict["uri"] = ( - str(access_service) - if isinstance(access_service, URIRef) - else "" - ) - - # Remember the (internal) access service reference for referencing in - # further profiles, e.g. for adding more properties - access_service_dict["access_service_ref"] = str( - access_service - ) - - access_service_list.append(access_service_dict) - - if access_service_list: - resource_dict["access_services"] = json.dumps( - access_service_list - ) + values = self._object_value_list(access_service, predicate) + if values: + access_service_dict[key] = values + + # Access service URI (explicitly show the missing ones) + access_service_dict["uri"] = ( + str(access_service) + if isinstance(access_service, URIRef) + else "" + ) + + # Remember the (internal) access service reference for referencing in + # further profiles, e.g. for adding more properties + access_service_dict["access_service_ref"] = str(access_service) + + access_service_list.append(access_service_dict) + + if access_service_list: + resource_dict["access_services"] = json.dumps( + access_service_list + ) return dataset_dict @@ -253,60 +249,54 @@ def graph_from_dataset(self, dataset_dict, dataset_ref): ] self._add_list_triples_from_dict(resource_dict, distribution, items) - try: - access_service_list = json.loads( - resource_dict.get("access_services", "[]") + # Access services + access_service_list = resource_dict.get("access_services", []) + if isinstance(access_service_list, str): + try: + access_service_list = json.loads(access_service_list) + except ValueError: + access_service_list = [] + + for access_service_dict in access_service_list: + + access_service_uri = access_service_dict.get("uri") + if access_service_uri: + access_service_node = CleanedURIRef(access_service_uri) + else: + access_service_node = BNode() + # Remember the (internal) access service reference for referencing in + # further profiles + access_service_dict["access_service_ref"] = str(access_service_node) + + self.g.add((distribution, DCAT.accessService, access_service_node)) + + self.g.add((access_service_node, RDF.type, DCAT.DataService)) + + # Simple values + items = [ + ("availability", DCATAP.availability, None, URIRefOrLiteral), + ("license", DCT.license, None, URIRefOrLiteral), + ("access_rights", DCT.accessRights, None, URIRefOrLiteral), + ("title", DCT.title, None, Literal), + ("endpoint_description", DCAT.endpointDescription, None, Literal), + ("description", DCT.description, None, Literal), + ] + + self._add_triples_from_dict( + access_service_dict, access_service_node, items ) - # Access service - for access_service_dict in access_service_list: - - access_service_uri = access_service_dict.get("uri") - if access_service_uri: - access_service_node = CleanedURIRef(access_service_uri) - else: - access_service_node = BNode() - # Remember the (internal) access service reference for referencing in - # further profiles - access_service_dict["access_service_ref"] = str( - access_service_node - ) - - self.g.add((distribution, DCAT.accessService, access_service_node)) - - self.g.add((access_service_node, RDF.type, DCAT.DataService)) - - # Simple values - items = [ - ("availability", DCATAP.availability, None, URIRefOrLiteral), - ("license", DCT.license, None, URIRefOrLiteral), - ("access_rights", DCT.accessRights, None, URIRefOrLiteral), - ("title", DCT.title, None, Literal), - ( - "endpoint_description", - DCAT.endpointDescription, - None, - Literal, - ), - ("description", DCT.description, None, Literal), - ] - - self._add_triples_from_dict( - access_service_dict, access_service_node, items - ) - # Lists - items = [ - ("endpoint_url", DCAT.endpointURL, None, URIRefOrLiteral), - ("serves_dataset", DCAT.servesDataset, None, URIRefOrLiteral), - ] - self._add_list_triples_from_dict( - access_service_dict, access_service_node, items - ) + # Lists + items = [ + ("endpoint_url", DCAT.endpointURL, None, URIRefOrLiteral), + ("serves_dataset", DCAT.servesDataset, None, URIRefOrLiteral), + ] + self._add_list_triples_from_dict( + access_service_dict, access_service_node, items + ) - if access_service_list: - resource_dict["access_services"] = json.dumps(access_service_list) - except ValueError: - pass + if access_service_list: + resource_dict["access_services"] = json.dumps(access_service_list) def graph_from_catalog(self, catalog_dict, catalog_ref): diff --git a/ckanext/dcat/profiles/euro_dcat_ap_scheming.py b/ckanext/dcat/profiles/euro_dcat_ap_scheming.py new file mode 100644 index 00000000..12eb540e --- /dev/null +++ b/ckanext/dcat/profiles/euro_dcat_ap_scheming.py @@ -0,0 +1,220 @@ +import json + +from rdflib import URIRef, BNode, Literal +from .base import RDFProfile, CleanedURIRef, URIRefOrLiteral +from .base import ( + RDF, + XSD, + DCAT, + DCT, + VCARD, + FOAF, + SCHEMA, + SKOS, + LOCN, +) + + +class EuropeanDCATAPSchemingProfile(RDFProfile): + """ + This is a compatibilty profile meant to add support for ckanext-scheming to the existing + `euro_dcat_ap` and `euro_dcat_ap_2` profiles. + It does not add or remove any properties from these profiles, it just transforms the + resulting dataset_dict so it is compatible with a ckanext-scheming schema + """ + + def parse_dataset(self, dataset_dict, dataset_ref): + """ + Modify the dataset_dict generated by the euro_dcat_ap andeuro_dcat_ap_2 profiles + to make it compatible with the scheming file definitions: + * Move extras to root level fields + * Parse lists (multiple text preset) + * Turn namespaced extras into repeating subfields + """ + + if not self._dataset_schema: + # Not using scheming + return dataset_dict + + # Move extras to root + + extras_to_remove = [] + extras = dataset_dict.get("extras", []) + for extra in extras: + if self._schema_field(extra["key"]): + # This is a field defined in the dataset schema + dataset_dict[extra["key"]] = extra["value"] + extras_to_remove.append(extra["key"]) + + dataset_dict["extras"] = [e for e in extras if e["key"] not in extras_to_remove] + + # Parse lists + def _parse_list_value(data_dict, field_name): + schema_field = self._schema_field( + field_name + ) or self._schema_resource_field(field_name) + + if schema_field and "scheming_multiple_text" in schema_field.get( + "validators", [] + ): + if isinstance(data_dict[field_name], str): + try: + data_dict[field_name] = json.loads(data_dict[field_name]) + except ValueError: + pass + + for field_name in dataset_dict.keys(): + _parse_list_value(dataset_dict, field_name) + + for resource_dict in dataset_dict.get("resources", []): + for field_name in resource_dict.keys(): + _parse_list_value(resource_dict, field_name) + + # Repeating subfields + new_fields_mapping = { + "temporal_coverage": "temporal" + } + for schema_field in self._dataset_schema["dataset_fields"]: + if "repeating_subfields" in schema_field: + # Check if existing extras need to be migrated + field_name = schema_field["field_name"] + new_extras = [] + new_dict = {} + check_name = new_fields_mapping.get(field_name, field_name) + for extra in dataset_dict.get("extras", []): + if extra["key"].startswith(f"{check_name}_"): + subfield = extra["key"][extra["key"].index("_") + 1 :] + if subfield in [ + f["field_name"] for f in schema_field["repeating_subfields"] + ]: + new_dict[subfield] = extra["value"] + else: + new_extras.append(extra) + else: + new_extras.append(extra) + if new_dict: + dataset_dict[field_name] = [new_dict] + dataset_dict["extras"] = new_extras + + # Repeating subfields: resources + for schema_field in self._dataset_schema["resource_fields"]: + if "repeating_subfields" in schema_field: + # Check if value needs to be load from JSON + field_name = schema_field["field_name"] + for resource_dict in dataset_dict.get("resources", []): + if resource_dict.get(field_name) and isinstance( + resource_dict[field_name], str + ): + try: + # TODO: load only subfields in schema? + resource_dict[field_name] = json.loads( + resource_dict[field_name] + ) + except ValueError: + pass + + return dataset_dict + + def graph_from_dataset(self, dataset_dict, dataset_ref): + """ + Add triples to the graph from new repeating subfields + """ + + def _not_empty_dict(data_dict): + return any(data_dict.values()) + + contact = dataset_dict.get("contact") + if isinstance(contact, list) and len(contact) and _not_empty_dict(contact[0]): + for item in contact: + contact_uri = item.get("uri") + if contact_uri: + contact_details = CleanedURIRef(contact_uri) + else: + contact_details = BNode() + + self.g.add((contact_details, RDF.type, VCARD.Organization)) + self.g.add((dataset_ref, DCAT.contactPoint, contact_details)) + + self._add_triple_from_dict(item, contact_details, VCARD.fn, "name") + # Add mail address as URIRef, and ensure it has a mailto: prefix + self._add_triple_from_dict( + item, + contact_details, + VCARD.hasEmail, + "email", + _type=URIRef, + value_modifier=self._add_mailto, + ) + + publisher = dataset_dict.get("publisher") + if isinstance(publisher, list) and len(publisher) and _not_empty_dict(publisher[0]): + publisher = publisher[0] + publisher_uri = publisher.get("uri") + if publisher_uri: + publisher_ref = CleanedURIRef(publisher_uri) + else: + publisher_ref = BNode() + + self.g.add((publisher_ref, RDF.type, FOAF.Organization)) + self.g.add((dataset_ref, DCT.publisher, publisher_ref)) + + self._add_triple_from_dict(publisher, publisher_ref, FOAF.name, "name") + self._add_triple_from_dict( + publisher, publisher_ref, FOAF.homepage, "url", _type=URIRef + ) + self._add_triple_from_dict( + publisher, publisher_ref, DCT.type, "type", _type=URIRefOrLiteral + ) + self._add_triple_from_dict( + publisher, + publisher_ref, + VCARD.hasEmail, + "email", + _type=URIRef, + value_modifier=self._add_mailto, + ) + + temporal = dataset_dict.get("temporal_coverage") + if isinstance(temporal, list) and len(temporal) and _not_empty_dict(temporal[0]): + for item in temporal: + temporal_ref = BNode() + self.g.add((temporal_ref, RDF.type, DCT.PeriodOfTime)) + if item.get("start"): + self._add_date_triple(temporal_ref, SCHEMA.startDate, item["start"]) + if item.get("end"): + self._add_date_triple(temporal_ref, SCHEMA.endDate, item["end"]) + self.g.add((dataset_ref, DCT.temporal, temporal_ref)) + + spatial = dataset_dict.get("spatial_coverage") + if isinstance(spatial, list) and len(spatial) and _not_empty_dict(spatial[0]): + for item in spatial: + if item.get("uri"): + spatial_ref = CleanedURIRef(item["uri"]) + else: + spatial_ref = BNode() + self.g.add((spatial_ref, RDF.type, DCT.Location)) + self.g.add((dataset_ref, DCT.spatial, spatial_ref)) + + if item.get("text"): + self.g.add((spatial_ref, SKOS.prefLabel, Literal(item["text"]))) + + for field in [ + ("geom", LOCN.geometry), + ("bbox", DCAT.bbox), + ("centroid", DCAT.centroid), + ]: + if item.get(field[0]): + self._add_spatial_value_to_graph( + spatial_ref, field[1], item[field[0]] + ) + + resources = dataset_dict.get("resources", []) + for resource in resources: + if resource.get("access_services"): + if isinstance(resource["access_services"], str): + try: + resource["access_services"] = json.loads( + resource["access_services"] + ) + except ValueError: + pass diff --git a/ckanext/dcat/schemas/__init__.py b/ckanext/dcat/schemas/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/ckanext/dcat/schemas/dcat_ap_2.1_full.yaml b/ckanext/dcat/schemas/dcat_ap_2.1_full.yaml new file mode 100644 index 00000000..d9532011 --- /dev/null +++ b/ckanext/dcat/schemas/dcat_ap_2.1_full.yaml @@ -0,0 +1,379 @@ +scheming_version: 2 +dataset_type: dataset +about: Full DCAT AP 2.1 schema +about_url: http://github.com/ckan/ckanext-dcat + +dataset_fields: + +- field_name: title + label: Title + preset: title + required: true + help_text: A descriptive title for the dataset. + +- field_name: name + label: URL + preset: dataset_slug + form_placeholder: eg. my-dataset + +- field_name: notes + label: Description + required: true + form_snippet: markdown.html + help_text: A free-text account of the dataset. + +- field_name: tag_string + label: Keywords + preset: tag_string_autocomplete + form_placeholder: eg. economy, mental health, government + help_text: Keywords or tags describing the dataset. Use commas to separate multiple values. + +- field_name: contact + label: Contact points + repeating_label: Contact point + repeating_subfields: + + - field_name: uri + label: URI + + - field_name: name + label: Name + + - field_name: email + label: Email + display_snippet: email.html + help_text: Contact information for enquiries about the dataset. + +- field_name: publisher + label: Publisher + repeating_label: Publisher + repeating_once: true + repeating_subfields: + + - field_name: uri + label: URI + + - field_name: name + label: Name + + - field_name: email + label: Email + display_snippet: email.html + + - field_name: url + label: URL + display_snippet: link.html + + - field_name: type + label: Type + help_text: Entity responsible for making the dataset available. + +- field_name: license_id + label: License + form_snippet: license.html + help_text: License definitions and additional information can be found at http://opendefinition.org/. + +- field_name: owner_org + label: Organization + preset: dataset_organization + help_text: The CKAN organization the dataset belongs to. + +- field_name: url + label: Landing page + form_placeholder: http://example.com/dataset.json + display_snippet: link.html + help_text: Web page that can be navigated to gain access to the dataset, its distributions and/or additional information. + + # Note: this will fall back to metadata_created if not present +- field_name: issued + label: Release date + preset: dcat_date + help_text: Date of publication of the dataset. + + # Note: this will fall back to metadata_modified if not present +- field_name: modified + label: Modification date + preset: dcat_date + help_text: Most recent date on which the dataset was changed, updated or modified. + +- field_name: version + label: Version + validators: ignore_missing unicode_safe package_version_validator + help_text: Version number or other version designation of the dataset. + +- field_name: version_notes + label: Version notes + validators: ignore_missing unicode_safe + form_snippet: markdown.html + display_snippet: markdown.html + help_text: A description of the differences between this version and a previous version of the dataset. + + # Note: CKAN will generate a unique identifier for each dataset +- field_name: identifier + label: Identifier + help_text: A unique identifier of the dataset. + +- field_name: frequency + label: Frequency + help_text: The frequency at which dataset is published. + +- field_name: provenance + label: Provenance + form_snippet: markdown.html + display_snippet: markdown.html + help_text: A statement about the lineage of the dataset. + +- field_name: dcat_type + label: Type + help_text: The type of the dataset. + # TODO: controlled vocabulary? + +- field_name: temporal_coverage + label: Temporal coverage + repeating_subfields: + + - field_name: start + label: Start + preset: dcat_date + + - field_name: end + label: End + preset: dcat_date + help_text: The temporal period or periods the dataset covers. + +- field_name: temporal_resolution + label: Temporal resolution + preset: multiple_text + validators: ignore_missing scheming_multiple_text + help_text: Minimum time period resolvable in the dataset. + +- field_name: spatial_coverage + label: Spatial coverage + repeating_subfields: + + - field_name: uri + label: URI + + - field_name: text + label: Label + + - field_name: geom + label: Geometry + + - field_name: bbox + label: Bounding Box + + - field_name: centroid + label: Centroid + help_text: A geographic region that is covered by the dataset. + +- field_name: spatial_resolution_in_meters + label: Spatial resolution in meters + preset: multiple_text + validators: ignore_missing scheming_multiple_number + help_text: Minimum spatial separation resolvable in a dataset, measured in meters. + +- field_name: access_rights + label: Access rights + validators: ignore_missing unicode_safe + form_snippet: markdown.html + display_snippet: markdown.html + help_text: Information that indicates whether the dataset is Open Data, has access restrictions or is not public. + +- field_name: alternate_identifier + label: Other identifier + preset: multiple_text + validators: ignore_missing scheming_multiple_text + help_text: This property refers to a secondary identifier of the dataset, such as MAST/ADS, DataCite, DOI, etc. + +- field_name: theme + label: Theme + preset: multiple_text + validators: ignore_missing scheming_multiple_text + help_text: A category of the dataset. A Dataset may be associated with multiple themes. + +- field_name: language + label: Language + preset: multiple_text + validators: ignore_missing scheming_multiple_text + help_text: Language or languages of the dataset. + # TODO: language form snippet / validator / graph + +- field_name: documentation + label: Documentation + preset: multiple_text + validators: ignore_missing scheming_multiple_text + help_text: A page or document about this dataset. + +- field_name: conforms_to + label: Conforms to + preset: multiple_text + validators: ignore_missing scheming_multiple_text + help_text: An implementing rule or other specification that the dataset follows. + +- field_name: is_referenced_by + label: Is referenced by + preset: multiple_text + validators: ignore_missing scheming_multiple_text + help_text: A related resource, such as a publication, that references, cites, or otherwise points to the dataset. + +- field_name: applicable_legislation + label: Applicable legislation + preset: multiple_text + validators: ignore_missing scheming_multiple_text + help_text: The legislation that mandates the creation or management of the dataset. + +#- field_name: hvd_category +# label: HVD Category +# preset: multiple_text +# validators: ignore_missing scheming_multiple_text +# TODO: implement separately as part of wider HVD support + +# Note: if not provided, this will be autogenerated +- field_name: uri + label: URI + help_text: An URI for this dataset (if not provided it will be autogenerated). + +# TODO: relation-based properties are not yet included (e.g. is_version_of, source, sample, etc) +# +resource_fields: + +- field_name: url + label: URL + preset: resource_url_upload + +- field_name: name + label: Name + form_placeholder: + help_text: A descriptive title for the resource. + +- field_name: description + label: Description + form_snippet: markdown.html + help_text: A free-text account of the resource. + +- field_name: format + label: Format + preset: resource_format_autocomplete + help_text: File format. If not provided it will be guessed. + +- field_name: mimetype + label: Media type + validators: if_empty_guess_format ignore_missing unicode_safe + help_text: Media type for this format. If not provided it will be guessed. + +- field_name: compress_format + label: Compress format + help_text: The format of the file in which the data is contained in a compressed form. + +- field_name: package_format + label: Package format + help_text: The format of the file in which one or more data files are grouped together. + +- field_name: size + label: Size + validators: ignore_missing int_validator + form_snippet: number.html + display_snippet: file_size.html + help_text: File size in bytes + +- field_name: hash + label: Hash + help_text: Checksum of the downloaded file. + +- field_name: hash_algorithm + label: Hash Algorithm + help_text: Algorithm used to calculate to checksum. + +- field_name: rights + label: Rights + form_snippet: markdown.html + display_snippet: markdown.html + help_text: Some statement about the rights associated with the resource. + +- field_name: availability + label: Availability + help_text: Indicates how long it is planned to keep the resource available. + +- field_name: status + label: Status + preset: select + choices: + - value: http://purl.org/adms/status/Completed + label: Completed + - value: http://purl.org/adms/status/UnderDevelopment + label: Under Development + - value: http://purl.org/adms/status/Deprecated + label: Deprecated + - value: http://purl.org/adms/status/Withdrawn + label: Withdrawn + help_text: The status of the resource in the context of maturity lifecycle. + +- field_name: license + label: License + help_text: License in which the resource is made available. If not provided will be inherited from the dataset. + + # Note: this falls back to the standard resource url field +- field_name: access_url + label: Access URL + help_text: URL that gives access to the dataset (defaults to the standard resource URL). + + # Note: this falls back to the standard resource url field +- field_name: download_url + label: Download URL + help_text: URL that provides a direct link to a downloadable file (defaults to the standard resource URL). + +- field_name: issued + label: Release date + preset: dcat_date + help_text: Date of publication of the resource. + +- field_name: modified + label: Modification date + preset: dcat_date + help_text: Most recent date on which the resource was changed, updated or modified. + +- field_name: language + label: Language + preset: multiple_text + validators: ignore_missing scheming_multiple_text + help_text: Language or languages of the resource. + +- field_name: documentation + label: Documentation + preset: multiple_text + validators: ignore_missing scheming_multiple_text + help_text: A page or document about this resource. + +- field_name: conforms_to + label: Conforms to + preset: multiple_text + validators: ignore_missing scheming_multiple_text + help_text: An established schema to which the described resource conforms. + +- field_name: applicable_legislation + label: Applicable legislation + preset: multiple_text + validators: ignore_missing scheming_multiple_text + help_text: The legislation that mandates the creation or management of the resource. + +- field_name: access_services + label: Access services + repeating_label: Access service + repeating_subfields: + + - field_name: uri + label: URI + + - field_name: title + label: Title + + - field_name: endpoint_url + label: Endpoint URL + preset: multiple_text + help_text: A data service that gives access to the resource. + + # Note: if not provided, this will be autogenerated +- field_name: uri + label: URI + help_text: An URI for this resource (if not provided it will be autogenerated). diff --git a/ckanext/dcat/schemas/dcat_ap_2.1_recommended.yaml b/ckanext/dcat/schemas/dcat_ap_2.1_recommended.yaml new file mode 100644 index 00000000..ed386d67 --- /dev/null +++ b/ckanext/dcat/schemas/dcat_ap_2.1_recommended.yaml @@ -0,0 +1,147 @@ +scheming_version: 2 +dataset_type: dataset +about: Recommended fields for DCAT AP 2.1 schema +about_url: http://github.com/ckan/ckanext-dcat + +dataset_fields: + +- field_name: title + label: Title + preset: title + required: true + help_text: A descriptive title for the dataset. + +- field_name: name + label: URL + preset: dataset_slug + form_placeholder: eg. my-dataset + +- field_name: notes + label: Description + required: true + form_snippet: markdown.html + help_text: A free-text account of the dataset. + +- field_name: tag_string + label: Keywords + preset: tag_string_autocomplete + form_placeholder: eg. economy, mental health, government + help_text: Keywords or tags describing the dataset. Use commas to separate multiple values. + +- field_name: contact + label: Contact points + repeating_label: Contact point + repeating_subfields: + + - field_name: uri + label: URI + + - field_name: name + label: Name + + - field_name: email + label: Email + display_snippet: email.html + help_text: Contact information for enquiries about the dataset. + +- field_name: publisher + label: Publisher + repeating_label: Publisher + repeating_once: true + repeating_subfields: + + - field_name: uri + label: URI + + - field_name: name + label: Name + + - field_name: email + label: Email + display_snippet: email.html + + - field_name: url + label: URL + display_snippet: link.html + + - field_name: type + label: Type + help_text: Entity responsible for making the dataset available. + +- field_name: license_id + label: License + form_snippet: license.html + help_text: License definitions and additional information can be found at http://opendefinition.org/. + +- field_name: owner_org + label: Organization + preset: dataset_organization + help_text: The CKAN organization the dataset belongs to. + +- field_name: temporal_coverage + label: Temporal coverage + repeating_subfields: + + - field_name: start + label: Start + preset: dcat_date + + - field_name: end + label: End + preset: dcat_date + help_text: The temporal period or periods the dataset covers. + +- field_name: spatial_coverage + label: Spatial coverage + repeating_subfields: + + - field_name: uri + label: URI + + - field_name: text + label: Label + + - field_name: geom + label: Geometry + + - field_name: bbox + label: Bounding Box + + - field_name: centroid + label: Centroid + help_text: A geographic region that is covered by the dataset. + +- field_name: theme + label: Theme + preset: multiple_text + validators: ignore_missing scheming_multiple_text + help_text: A category of the dataset. A Dataset may be associated with multiple themes. + +resource_fields: + +- field_name: url + label: URL + preset: resource_url_upload + +- field_name: name + label: Name + form_placeholder: + help_text: A descriptive title for the resource. + +- field_name: description + label: Description + form_snippet: markdown.html + help_text: A free-text account of the resource. + +- field_name: format + label: Format + preset: resource_format_autocomplete + help_text: File format. If not provided it will be guessed. + +- field_name: availability + label: Availability + help_text: Indicates how long it is planned to keep the resource available. + +- field_name: license + label: License + help_text: License in which the resource is made available. If not provided will be inherited from the dataset. diff --git a/ckanext/dcat/schemas/presets.yaml b/ckanext/dcat/schemas/presets.yaml new file mode 100644 index 00000000..88be7b0c --- /dev/null +++ b/ckanext/dcat/schemas/presets.yaml @@ -0,0 +1,12 @@ +scheming_presets_version: 1 +about: Presets for the ckanext-dcat extension +about_url": "http://github.com/ckan/ckanext-dcat" + +presets: + +- preset_name: dcat_date + values: + # Note: use datetime.html or datetime_tz.html if you want to inclue an input for time + form_snippet: date.html + display_snippet: dcat_date.html + validators: ignore_missing dcat_date convert_to_json_if_datetime diff --git a/ckanext/dcat/schemas/publisher_organization.yaml b/ckanext/dcat/schemas/publisher_organization.yaml new file mode 100644 index 00000000..3d1f7d3b --- /dev/null +++ b/ckanext/dcat/schemas/publisher_organization.yaml @@ -0,0 +1,35 @@ +scheming_version: 2 +about_url: http://github.com/ckan/ckanext-dcat +description: > + An organization schema that implements the properties supported + by default in the dct:publisher property of a dcat:Dataset + +fields: + +- field_name: title + label: Name + validators: ignore_missing unicode_safe + form_snippet: large_text.html + form_attrs: {data-module: slug-preview-target} + +- field_name: name + label: URL + validators: not_empty unicode_safe name_validator group_name_validator + form_snippet: slug.html + form_placeholder: my-theme + +- field_name: notes + label: Description + form_snippet: markdown.html + form_placeholder: A little information about this organization. + +- field_name: email + label: Email + display_snippet: email.html + +- field_name: url + label: URL + display_snippet: link.html + +- field_name: dcat_type + label: Type diff --git a/ckanext/dcat/templates/scheming/display_snippets/dcat_date.html b/ckanext/dcat/templates/scheming/display_snippets/dcat_date.html new file mode 100644 index 00000000..3e7f7ec6 --- /dev/null +++ b/ckanext/dcat/templates/scheming/display_snippets/dcat_date.html @@ -0,0 +1,4 @@ +{{ h.render_datetime(data[field.field_name]) }} + +{# Use the following if you want to include the time as well #} +{# h.render_datetime(data[field.field_name], with_hours=True) #} diff --git a/ckanext/dcat/templates/scheming/display_snippets/file_size.html b/ckanext/dcat/templates/scheming/display_snippets/file_size.html new file mode 100644 index 00000000..ca7e5057 --- /dev/null +++ b/ckanext/dcat/templates/scheming/display_snippets/file_size.html @@ -0,0 +1 @@ +{{ h.localised_filesize(data[field.field_name]) }} diff --git a/ckanext/dcat/templates/scheming/form_snippets/number.html b/ckanext/dcat/templates/scheming/form_snippets/number.html new file mode 100644 index 00000000..bed99336 --- /dev/null +++ b/ckanext/dcat/templates/scheming/form_snippets/number.html @@ -0,0 +1,16 @@ +{% import 'macros/form.html' as form %} +{% call form.input( + field.field_name, + id='field-' + field.field_name, + label=h.scheming_language_text(field.label), + placeholder=h.scheming_language_text(field.form_placeholder), + type='number', + value=data.get(field.field_name), + error=errors[field.field_name], + classes=field.classes if 'classes' in field else ['control-medium'], + attrs=dict({"class": "form-control"}, **(field.get('form_attrs', {}))), + is_required=h.scheming_field_required(field) + ) +%} + {%- snippet 'scheming/form_snippets/help_text.html', field=field -%} +{% endcall %} diff --git a/ckanext/dcat/templates/scheming/form_snippets/repeating_subfields.html b/ckanext/dcat/templates/scheming/form_snippets/repeating_subfields.html new file mode 100644 index 00000000..dec11f45 --- /dev/null +++ b/ckanext/dcat/templates/scheming/form_snippets/repeating_subfields.html @@ -0,0 +1,8 @@ +{% ckan_extends %} + +{% block add_button %} + {# Hide the Add button if we only want one set of subfields #} + {% if not field.repeating_once %} + {{ super() }} + {% endif %} +{% endblock %} diff --git a/ckanext/dcat/tests/test_euro_dcatap_2_profile_serialize.py b/ckanext/dcat/tests/test_euro_dcatap_2_profile_serialize.py index 114dc602..abf80363 100644 --- a/ckanext/dcat/tests/test_euro_dcatap_2_profile_serialize.py +++ b/ckanext/dcat/tests/test_euro_dcatap_2_profile_serialize.py @@ -298,13 +298,13 @@ def test_temporal(self): for predicate in [SCHEMA.startDate, DCAT.startDate]: triples = [] for temporal_obj in temporal_obj_list: - triples.extend(self._triples(g, temporal_obj, predicate, parse_date(extras['temporal_start']).isoformat(), XSD.dateTime)) + triples.extend(self._triples(g, temporal_obj, predicate, extras['temporal_start'], XSD.dateTime)) assert len(triples) == 1 for predicate in [SCHEMA.endDate, DCAT.endDate]: triples = [] for temporal_obj in temporal_obj_list: - triples.extend(self._triples(g, temporal_obj, predicate, parse_date(extras['temporal_end']).isoformat(), XSD.dateTime)) + triples.extend(self._triples(g, temporal_obj, predicate, extras['temporal_end'], XSD.date)) assert len(triples) == 1 def test_high_value_datasets(self): diff --git a/ckanext/dcat/tests/test_euro_dcatap_profile_serialize.py b/ckanext/dcat/tests/test_euro_dcatap_profile_serialize.py index dd43ef62..edec0c5a 100644 --- a/ckanext/dcat/tests/test_euro_dcatap_profile_serialize.py +++ b/ckanext/dcat/tests/test_euro_dcatap_profile_serialize.py @@ -1,6 +1,7 @@ from builtins import str from builtins import object import json +import uuid import pytest @@ -400,11 +401,17 @@ def test_publisher_extras(self): assert self._triple(g, publisher, DCT.type, URIRef(extras['publisher_type'])) def test_publisher_org(self): + org_id = str(uuid.uuid4()) + factories.Organization( + id=org_id, + name='publisher1', + title='Example Publisher from Org' + ) dataset = { 'id': '4b6fe9ca-dc77-4cec-92a4-55c6624a5bd6', 'name': 'test-dataset', 'organization': { - 'id': '', + 'id': org_id, 'name': 'publisher1', 'title': 'Example Publisher from Org', } @@ -496,8 +503,8 @@ def test_temporal(self): assert temporal assert self._triple(g, temporal, RDF.type, DCT.PeriodOfTime) - assert self._triple(g, temporal, SCHEMA.startDate, parse_date(extras['temporal_start']).isoformat(), XSD.dateTime) - assert self._triple(g, temporal, SCHEMA.endDate, parse_date(extras['temporal_end']).isoformat(), XSD.dateTime) + assert self._triple(g, temporal, SCHEMA.startDate, extras['temporal_start'], XSD.dateTime) + assert self._triple(g, temporal, SCHEMA.endDate, extras['temporal_end'], XSD.date) def test_spatial(self): dataset = { @@ -1121,6 +1128,30 @@ def test_hash_algorithm_not_uri(self): assert self._triple(g, checksum, SPDX.checksumValue, resource['hash'], data_type='http://www.w3.org/2001/XMLSchema#hexBinary') assert self._triple(g, checksum, SPDX.algorithm, resource['hash_algorithm']) + @pytest.mark.parametrize("value,data_type", [ + ("2024", XSD.gYear), + ("2024-05", XSD.gYearMonth), + ("2024-05-31", XSD.date), + ("2024-05-31T00:00:00", XSD.dateTime), + ("2024-05-31T12:30:01", XSD.dateTime), + ("2024-05-31T12:30:01.451243", XSD.dateTime), + ]) + def test_dates_data_types(self, value, data_type): + dataset = { + 'id': '4b6fe9ca-dc77-4cec-92a4-55c6624a5bd6', + 'name': 'test-dataset', + 'title': 'Test DCAT dataset', + 'issued': value, + } + + s = RDFSerializer(profiles=['euro_dcat_ap']) + g = s.g + + dataset_ref = s.graph_from_dataset(dataset) + + assert str(self._triple(g, dataset_ref, DCT.issued, None)[2]) == value + assert self._triple(g, dataset_ref, DCT.issued, None)[2].datatype == data_type + class TestEuroDCATAPProfileSerializeCatalog(BaseSerializeTest): diff --git a/ckanext/dcat/tests/test_scheming_support.py b/ckanext/dcat/tests/test_scheming_support.py new file mode 100644 index 00000000..d79523fd --- /dev/null +++ b/ckanext/dcat/tests/test_scheming_support.py @@ -0,0 +1,970 @@ +from unittest import mock +import json + +import pytest +from rdflib.namespace import RDF +from rdflib.term import URIRef +from geomet import wkt + +from ckan.tests import factories +from ckan.tests.helpers import call_action + +from ckanext.dcat import utils +from ckanext.dcat.processors import RDFSerializer, RDFParser +from ckanext.dcat.profiles import ( + DCAT, + DCATAP, + DCT, + ADMS, + XSD, + VCARD, + FOAF, + SCHEMA, + SKOS, + LOCN, + GSP, + OWL, + GEOJSON_IMT, + SPDX, +) +from ckanext.dcat.tests.utils import BaseSerializeTest, BaseParseTest + + +@pytest.mark.usefixtures("with_plugins", "clean_db") +@pytest.mark.ckan_config("ckan.plugins", "dcat scheming_datasets") +@pytest.mark.ckan_config( + "scheming.dataset_schemas", "ckanext.dcat.schemas:dcat_ap_2.1_full.yaml" +) +@pytest.mark.ckan_config( + "scheming.presets", + "ckanext.scheming:presets.json ckanext.dcat.schemas:presets.yaml", +) +@pytest.mark.ckan_config( + "ckanext.dcat.rdf.profiles", "euro_dcat_ap_2 euro_dcat_ap_scheming" +) +class TestSchemingSerializeSupport(BaseSerializeTest): + def test_e2e_ckan_to_dcat(self): + """ + Create a dataset using the scheming schema, check that fields + are exposed in the DCAT RDF graph + """ + + dataset_dict = { + # Core fields + "name": "test-dataset", + "title": "Test DCAT dataset", + "notes": "Lorem ipsum", + "url": "http://example.org/ds1", + "version": "1.0b", + "tags": [{"name": "Tag 1"}, {"name": "Tag 2"}], + # Standard fields + "issued": "2024-05-01", + "modified": "2024-05-05", + "identifier": "xx-some-dataset-id-yy", + "frequency": "monthly", + "provenance": "Statement about provenance", + "dcat_type": "test-type", + "version_notes": "Some version notes", + "access_rights": "Statement about access rights", + # List fields (lists) + "alternate_identifier": ["alt-id-1", "alt-id-2"], + "theme": [ + "https://example.org/uri/theme1", + "https://example.org/uri/theme2", + "https://example.org/uri/theme3", + ], + "language": ["en", "ca", "es"], + "documentation": ["https://example.org/some-doc.html"], + "conforms_to": ["Standard 1", "Standard 2"], + "is_referenced_by": [ + "https://doi.org/10.1038/sdata.2018.22", + "test_isreferencedby", + ], + "applicable_legislation": [ + "http://data.europa.eu/eli/reg_impl/2023/138/oj", + "http://data.europa.eu/eli/reg_impl/2023/138/oj_alt", + ], + # Repeating subfields + "contact": [ + {"name": "Contact 1", "email": "contact1@example.org"}, + {"name": "Contact 2", "email": "contact2@example.org"}, + ], + "publisher": [ + { + "name": "Test Publisher", + "email": "publisher@example.org", + "url": "https://example.org", + "type": "public_body", + }, + ], + "temporal_coverage": [ + {"start": "1905-03-01", "end": "2013-01-05"}, + {"start": "2024-04-10", "end": "2024-05-29"}, + ], + "temporal_resolution": ["PT15M", "P1D"], + "spatial_coverage": [ + { + "geom": { + "type": "Polygon", + "coordinates": [ + [ + [11.9936, 54.0486], + [11.9936, 54.2466], + [12.3045, 54.2466], + [12.3045, 54.0486], + [11.9936, 54.0486], + ] + ], + }, + "text": "Tarragona", + "uri": "https://sws.geonames.org/6361390/", + "bbox": { + "type": "Polygon", + "coordinates": [ + [ + [-2.1604, 42.7611], + [-2.0938, 42.7611], + [-2.0938, 42.7931], + [-2.1604, 42.7931], + [-2.1604, 42.7611], + ] + ], + }, + "centroid": {"type": "Point", "coordinates": [1.26639, 41.12386]}, + } + ], + "spatial_resolution_in_meters": [1.5, 2.0], + "resources": [ + { + "name": "Resource 1", + "description": "Some description", + "url": "https://example.com/data.csv", + "format": "CSV", + "availability": "http://publications.europa.eu/resource/authority/planned-availability/EXPERIMENTAL", + "compress_format": "http://www.iana.org/assignments/media-types/application/gzip", + "package_format": "http://publications.europa.eu/resource/authority/file-type/TAR", + "size": 12323, + "hash": "4304cf2e751e6053c90b1804c89c0ebb758f395a", + "hash_algorithm": "http://spdx.org/rdf/terms#checksumAlgorithm_sha1", + "status": "http://purl.org/adms/status/Completed", + "access_url": "https://example.com/data.csv", + "download_url": "https://example.com/data.csv", + "issued": "2024-05-01T01:20:33", + "modified": "2024-05-05T09:33:20", + "license": "http://creativecommons.org/licenses/by/3.0/", + "rights": "Some stament about rights", + "language": ["en", "ca", "es"], + "access_services": [ + { + "title": "Access Service 1", + "endpoint_url": [ + "https://example.org/access_service/1", + "https://example.org/access_service/2", + ], + } + ], + } + ], + } + + dataset = call_action("package_create", **dataset_dict) + + # Make sure schema was used + assert dataset["conforms_to"][0] == "Standard 1" + assert dataset["contact"][0]["name"] == "Contact 1" + + s = RDFSerializer() + g = s.g + + dataset_ref = s.graph_from_dataset(dataset) + + assert str(dataset_ref) == utils.dataset_uri(dataset) + + # Core fields + assert self._triple(g, dataset_ref, RDF.type, DCAT.Dataset) + assert self._triple(g, dataset_ref, DCT.title, dataset["title"]) + assert self._triple(g, dataset_ref, DCT.description, dataset["notes"]) + assert self._triple(g, dataset_ref, OWL.versionInfo, dataset["version"]) + + # Standard fields + assert self._triple(g, dataset_ref, DCT.identifier, dataset["identifier"]) + assert self._triple( + g, dataset_ref, DCT.accrualPeriodicity, dataset["frequency"] + ) + assert self._triple(g, dataset_ref, DCT.provenance, dataset["provenance"]) + assert self._triple(g, dataset_ref, DCT.type, dataset["dcat_type"]) + assert self._triple(g, dataset_ref, ADMS.versionNotes, dataset["version_notes"]) + assert self._triple(g, dataset_ref, DCT.accessRights, dataset["access_rights"]) + + # Dates + assert self._triple( + g, + dataset_ref, + DCT.issued, + dataset["issued"], + data_type=XSD.date, + ) + assert self._triple( + g, + dataset_ref, + DCT.modified, + dataset["modified"], + data_type=XSD.date, + ) + + # List fields + + assert ( + self._triples_list_values(g, dataset_ref, DCT.conformsTo) + == dataset["conforms_to"] + ) + assert ( + self._triples_list_values(g, dataset_ref, ADMS.identifier) + == dataset["alternate_identifier"] + ) + assert self._triples_list_values(g, dataset_ref, DCAT.theme) == dataset["theme"] + assert ( + self._triples_list_values(g, dataset_ref, DCT.language) + == dataset["language"] + ) + assert ( + self._triples_list_values(g, dataset_ref, FOAF.page) + == dataset["documentation"] + ) + assert ( + self._triples_list_values(g, dataset_ref, DCAT.temporalResolution) + == dataset["temporal_resolution"] + ) + assert ( + self._triples_list_values(g, dataset_ref, DCT.isReferencedBy) + == dataset["is_referenced_by"] + ) + assert ( + self._triples_list_values(g, dataset_ref, DCATAP.applicableLegislation) + == dataset["applicable_legislation"] + ) + + assert ( + self._triples_list_python_values( + g, dataset_ref, DCAT.spatialResolutionInMeters + ) + == dataset["spatial_resolution_in_meters"] + ) + + # Repeating subfields + + contact_details = [t for t in g.triples((dataset_ref, DCAT.contactPoint, None))] + + assert len(contact_details) == len(dataset["contact"]) + assert self._triple( + g, contact_details[0][2], VCARD.fn, dataset_dict["contact"][0]["name"] + ) + assert self._triple( + g, + contact_details[0][2], + VCARD.hasEmail, + URIRef("mailto:" + dataset_dict["contact"][0]["email"]), + ) + assert self._triple( + g, contact_details[1][2], VCARD.fn, dataset_dict["contact"][1]["name"] + ) + assert self._triple( + g, + contact_details[1][2], + VCARD.hasEmail, + URIRef("mailto:" + dataset_dict["contact"][1]["email"]), + ) + + publisher = [t for t in g.triples((dataset_ref, DCT.publisher, None))] + + assert len(publisher) == 1 + assert self._triple( + g, publisher[0][2], FOAF.name, dataset_dict["publisher"][0]["name"] + ) + assert self._triple( + g, + publisher[0][2], + VCARD.hasEmail, + URIRef("mailto:" + dataset_dict["publisher"][0]["email"]), + ) + assert self._triple( + g, + publisher[0][2], + FOAF.homepage, + URIRef(dataset_dict["publisher"][0]["url"]), + ) + assert self._triple( + g, + publisher[0][2], + DCT.type, + dataset_dict["publisher"][0]["type"], + ) + + temporal = [t for t in g.triples((dataset_ref, DCT.temporal, None))] + + assert len(temporal) == len(dataset["temporal_coverage"]) + assert self._triple( + g, + temporal[0][2], + SCHEMA.startDate, + dataset_dict["temporal_coverage"][0]["start"], + data_type=XSD.date, + ) + assert self._triple( + g, + temporal[0][2], + SCHEMA.endDate, + dataset_dict["temporal_coverage"][0]["end"], + data_type=XSD.date, + ) + assert self._triple( + g, + temporal[1][2], + SCHEMA.startDate, + dataset_dict["temporal_coverage"][1]["start"], + data_type=XSD.date, + ) + assert self._triple( + g, + temporal[1][2], + SCHEMA.endDate, + dataset_dict["temporal_coverage"][1]["end"], + data_type=XSD.date, + ) + + spatial = [t for t in g.triples((dataset_ref, DCT.spatial, None))] + assert len(spatial) == len(dataset["spatial_coverage"]) + assert str(spatial[0][2]) == dataset["spatial_coverage"][0]["uri"] + assert self._triple(g, spatial[0][2], RDF.type, DCT.Location) + assert self._triple( + g, spatial[0][2], SKOS.prefLabel, dataset["spatial_coverage"][0]["text"] + ) + + assert len([t for t in g.triples((spatial[0][2], LOCN.geometry, None))]) == 2 + # Geometry in GeoJSON + assert self._triple( + g, + spatial[0][2], + LOCN.geometry, + dataset["spatial_coverage"][0]["geom"], + GEOJSON_IMT, + ) + # Geometry in WKT + wkt_geom = wkt.dumps(dataset["spatial_coverage"][0]["geom"], decimals=4) + assert self._triple(g, spatial[0][2], LOCN.geometry, wkt_geom, GSP.wktLiteral) + + distribution_ref = self._triple(g, dataset_ref, DCAT.distribution, None)[2] + resource = dataset_dict["resources"][0] + + # Resources: core fields + + assert self._triple(g, distribution_ref, DCT.title, resource["name"]) + assert self._triple( + g, + distribution_ref, + DCT.description, + resource["description"], + ) + + # Resources: standard fields + + assert self._triple(g, distribution_ref, DCT.rights, resource["rights"]) + assert self._triple( + g, distribution_ref, ADMS.status, URIRef(resource["status"]) + ) + assert self._triple( + g, + distribution_ref, + DCAT.accessURL, + URIRef(resource["access_url"]), + ) + assert self._triple( + g, + distribution_ref, + DCATAP.availability, + URIRef(resource["availability"]), + ) + assert self._triple( + g, + distribution_ref, + DCAT.compressFormat, + URIRef(resource["compress_format"]), + ) + assert self._triple( + g, + distribution_ref, + DCAT.packageFormat, + URIRef(resource["package_format"]), + ) + assert self._triple( + g, + distribution_ref, + DCAT.downloadURL, + URIRef(resource["download_url"]), + ) + + assert self._triple( + g, distribution_ref, DCAT.byteSize, float(resource["size"]), XSD.decimal + ) + # Checksum + checksum = self._triple(g, distribution_ref, SPDX.checksum, None)[2] + assert checksum + assert self._triple(g, checksum, RDF.type, SPDX.Checksum) + assert self._triple( + g, + checksum, + SPDX.checksumValue, + resource["hash"], + data_type="http://www.w3.org/2001/XMLSchema#hexBinary", + ) + assert self._triple( + g, checksum, SPDX.algorithm, URIRef(resource["hash_algorithm"]) + ) + + # Resources: dates + assert self._triple( + g, + distribution_ref, + DCT.issued, + dataset["resources"][0]["issued"], + data_type=XSD.dateTime, + ) + assert self._triple( + g, + distribution_ref, + DCT.modified, + dataset["resources"][0]["modified"], + data_type=XSD.dateTime, + ) + + # Resources: list fields + assert ( + self._triples_list_values(g, distribution_ref, DCT.language) + == resource["language"] + ) + + # Resource: repeating subfields + access_services = [ + t for t in g.triples((distribution_ref, DCAT.accessService, None)) + ] + + assert len(access_services) == len(dataset["resources"][0]["access_services"]) + assert self._triple( + g, + access_services[0][2], + DCT.title, + resource["access_services"][0]["title"], + ) + + endpoint_urls = [ + str(t[2]) + for t in g.triples((access_services[0][2], DCAT.endpointURL, None)) + ] + assert endpoint_urls == resource["access_services"][0]["endpoint_url"] + + def test_publisher_fallback_org(self): + + org = factories.Organization( + title="Some publisher org", + ) + dataset_dict = { + "name": "test-dataset-2", + "title": "Test DCAT dataset 2", + "notes": "Lorem ipsum", + "owner_org": org["id"], + } + + dataset = call_action("package_create", **dataset_dict) + + s = RDFSerializer() + g = s.g + + dataset_ref = s.graph_from_dataset(dataset) + publisher = [t for t in g.triples((dataset_ref, DCT.publisher, None))] + + assert len(publisher) == 1 + assert self._triple(g, publisher[0][2], FOAF.name, org["title"]) + + def test_publisher_fallback_org_ignored_if_publisher_field_present(self): + + org = factories.Organization() + dataset_dict = { + "name": "test-dataset-2", + "title": "Test DCAT dataset 2", + "notes": "Lorem ipsum", + "publisher": [ + { + "name": "Test Publisher", + "email": "publisher@example.org", + "url": "https://example.org", + "type": "public_body", + }, + ], + "owner_org": org["id"], + } + + dataset = call_action("package_create", **dataset_dict) + + s = RDFSerializer() + g = s.g + + dataset_ref = s.graph_from_dataset(dataset) + publisher = [t for t in g.triples((dataset_ref, DCT.publisher, None))] + + assert len(publisher) == 1 + assert self._triple( + g, publisher[0][2], FOAF.name, dataset_dict["publisher"][0]["name"] + ) + + def test_empty_repeating_subfields_not_serialized(self): + + dataset_dict = { + "name": "test-dataset-3", + "title": "Test DCAT dataset 3", + "notes": "Lorem ipsum", + "spatial_coverage": [ + { + "uri": "", + "geom": "", + }, + ], + } + + dataset = call_action("package_create", **dataset_dict) + + s = RDFSerializer() + g = s.g + + dataset_ref = s.graph_from_dataset(dataset) + assert not [t for t in g.triples((dataset_ref, DCT.spatial, None))] + + def test_legacy_fields(self): + + dataset_dict = { + "name": "test-dataset-2", + "title": "Test DCAT dataset 2", + "notes": "Lorem ipsum", + "extras": [ + {"key": "contact_name", "value": "Test Contact"}, + {"key": "contact_email", "value": "contact@example.org"}, + {"key": "publisher_name", "value": "Test Publisher"}, + {"key": "publisher_email", "value": "publisher@example.org"}, + {"key": "publisher_url", "value": "https://example.org"}, + {"key": "publisher_type", "value": "public_body"}, + ], + } + + dataset = call_action("package_create", **dataset_dict) + + s = RDFSerializer() + g = s.g + + dataset_ref = s.graph_from_dataset(dataset) + contact_details = [t for t in g.triples((dataset_ref, DCAT.contactPoint, None))] + assert len(contact_details) == 1 + assert self._triple(g, contact_details[0][2], VCARD.fn, "Test Contact") + + publisher = [t for t in g.triples((dataset_ref, DCT.publisher, None))] + assert len(publisher) == 1 + assert self._triple(g, publisher[0][2], FOAF.name, "Test Publisher") + + def test_dcat_date(self): + dataset_dict = { + # Core fields + "name": "test-dataset", + "title": "Test DCAT dataset", + "notes": "Some notes", + "issued": "2024", + "modified": "2024-10", + "temporal_coverage": [ + {"start": "1905-03-01T10:07:31.182680", "end": "2013-01-05"}, + {"start": "2024-04-10T10:07:31", "end": "2024-05-29"}, + {"start": "11/24/24", "end": "06/12/12"}, + ], + } + + dataset = call_action("package_create", **dataset_dict) + + s = RDFSerializer() + g = s.g + + dataset_ref = s.graph_from_dataset(dataset) + + # Year + assert dataset["issued"] == dataset_dict["issued"] + assert self._triple( + g, + dataset_ref, + DCT.issued, + dataset_dict["issued"], + data_type=XSD.gYear, + ) + + # Year-month + assert dataset["modified"] == dataset_dict["modified"] + assert self._triple( + g, + dataset_ref, + DCT.modified, + dataset_dict["modified"], + data_type=XSD.gYearMonth, + ) + + temporal = [t for t in g.triples((dataset_ref, DCT.temporal, None))] + + # Date + assert ( + dataset["temporal_coverage"][0]["end"] + == dataset_dict["temporal_coverage"][0]["end"] + ) + + assert self._triple( + g, + temporal[0][2], + SCHEMA.endDate, + dataset_dict["temporal_coverage"][0]["end"], + data_type=XSD.date, + ) + + # Datetime + assert ( + dataset["temporal_coverage"][0]["start"] + == dataset_dict["temporal_coverage"][0]["start"] + ) + assert self._triple( + g, + temporal[0][2], + SCHEMA.startDate, + dataset_dict["temporal_coverage"][0]["start"], + data_type=XSD.dateTime, + ) + + assert ( + dataset["temporal_coverage"][1]["start"] + == dataset_dict["temporal_coverage"][1]["start"] + ) + assert self._triple( + g, + temporal[1][2], + SCHEMA.startDate, + dataset_dict["temporal_coverage"][1]["start"], + data_type=XSD.dateTime, + ) + + # Ambiguous Datetime + assert ( + dataset["temporal_coverage"][2]["start"] + == dataset_dict["temporal_coverage"][2]["start"] + ) + assert self._triple( + g, + temporal[2][2], + SCHEMA.startDate, + "2024-11-24T00:00:00", + data_type=XSD.dateTime, + ) + assert ( + dataset["temporal_coverage"][2]["end"] + == dataset_dict["temporal_coverage"][2]["end"] + ) + assert self._triple( + g, + temporal[2][2], + SCHEMA.endDate, + "2012-06-12T00:00:00", + data_type=XSD.dateTime, + ) + + +@pytest.mark.usefixtures("with_plugins", "clean_db") +@pytest.mark.ckan_config("ckan.plugins", "dcat scheming_datasets") +@pytest.mark.ckan_config( + "scheming.dataset_schemas", "ckanext.dcat.schemas:dcat_ap_2.1_full.yaml" +) +@pytest.mark.ckan_config( + "scheming.presets", + "ckanext.scheming:presets.json ckanext.dcat.schemas:presets.yaml", +) +@pytest.mark.ckan_config( + "ckanext.dcat.rdf.profiles", "euro_dcat_ap_2 euro_dcat_ap_scheming" +) +class TestSchemingValidators: + def test_mimetype_is_guessed(self): + dataset_dict = { + "name": "test-dataset-2", + "title": "Test DCAT dataset 2", + "notes": "Lorem ipsum", + "resources": [ + {"url": "https://example.org/data.csv"}, + {"url": "https://example.org/report.pdf"}, + ], + } + + dataset = call_action("package_create", **dataset_dict) + + assert sorted([r["mimetype"] for r in dataset["resources"]]) == [ + "application/pdf", + "text/csv", + ] + + +@pytest.mark.usefixtures("with_plugins", "clean_db") +@pytest.mark.ckan_config("ckan.plugins", "dcat scheming_datasets") +@pytest.mark.ckan_config( + "scheming.dataset_schemas", "ckanext.dcat.schemas:dcat_ap_2.1_full.yaml" +) +@pytest.mark.ckan_config( + "scheming.presets", + "ckanext.scheming:presets.json ckanext.dcat.schemas:presets.yaml", +) +@pytest.mark.ckan_config( + "ckanext.dcat.rdf.profiles", "euro_dcat_ap_2 euro_dcat_ap_scheming" +) +class TestSchemingParseSupport(BaseParseTest): + def test_e2e_dcat_to_ckan(self): + """ + Parse a DCAT RDF graph into a CKAN dataset dict, create a dataset with package_create + and check that all expected fields are there + """ + contents = self._get_file_contents("dataset.rdf") + + p = RDFParser() + + p.parse(contents) + + datasets = [d for d in p.datasets()] + + assert len(datasets) == 1 + + dataset_dict = datasets[0] + + dataset_dict["name"] = "test-dcat-1" + dataset = call_action("package_create", **dataset_dict) + + # Core fields + + assert dataset["title"] == "Zimbabwe Regional Geochemical Survey." + assert ( + dataset["notes"] + == "During the period 1982-86 a team of geologists from the British Geological Survey ..." + ) + assert dataset["url"] == "http://dataset.info.org" + assert dataset["version"] == "2.3" + assert dataset["license_id"] == "cc-nc" + assert sorted([t["name"] for t in dataset["tags"]]) == [ + "exploration", + "geochemistry", + "geology", + ] + + # Standard fields + assert dataset["version_notes"] == "New schema added" + assert dataset["identifier"] == u"9df8df51-63db-37a8-e044-0003ba9b0d98" + assert dataset["frequency"] == "http://purl.org/cld/freq/daily" + assert dataset["access_rights"] == "public" + assert dataset["provenance"] == "Some statement about provenance" + assert dataset["dcat_type"] == "test-type" + + assert dataset["issued"] == u"2012-05-10" + assert dataset["modified"] == u"2012-05-10T21:04:00" + + # List fields + assert sorted(dataset["conforms_to"]) == ["Standard 1", "Standard 2"] + assert sorted(dataset["language"]) == ["ca", "en", "es"] + assert sorted(dataset["theme"]) == [ + "Earth Sciences", + "http://eurovoc.europa.eu/100142", + "http://eurovoc.europa.eu/209065", + ] + assert sorted(dataset["alternate_identifier"]) == [ + "alternate-identifier-1", + "alternate-identifier-2", + ] + assert sorted(dataset["documentation"]) == [ + "http://dataset.info.org/doc1", + "http://dataset.info.org/doc2", + ] + assert sorted(dataset["temporal_resolution"]) == [ + "P1D", + "PT15M", + ] + assert sorted(dataset["spatial_resolution_in_meters"]) == [ + 1.5, + 2.0, + ] + assert sorted(dataset["is_referenced_by"]) == [ + "https://doi.org/10.1038/sdata.2018.22", + "test_isreferencedby", + ] + assert sorted(dataset["applicable_legislation"]) == [ + "http://data.europa.eu/eli/reg_impl/2023/138/oj", + "http://data.europa.eu/eli/reg_impl/2023/138/oj_alt", + ] + # Repeating subfields + + assert dataset["contact"][0]["name"] == "Point of Contact" + assert dataset["contact"][0]["email"] == "contact@some.org" + + assert ( + dataset["publisher"][0]["name"] == "Publishing Organization for dataset 1" + ) + assert dataset["publisher"][0]["email"] == "contact@some.org" + assert dataset["publisher"][0]["url"] == "http://some.org" + assert ( + dataset["publisher"][0]["type"] + == "http://purl.org/adms/publishertype/NonProfitOrganisation" + ) + assert dataset["temporal_coverage"][0]["start"] == "1905-03-01" + assert dataset["temporal_coverage"][0]["end"] == "2013-01-05" + + resource = dataset["resources"][0] + + # Resources: core fields + assert resource["url"] == "http://www.bgs.ac.uk/gbase/geochemcd/home.html" + + # Resources: standard fields + assert resource["license"] == "http://creativecommons.org/licenses/by-nc/2.0/" + assert resource["rights"] == "Some statement about rights" + assert resource["issued"] == "2012-05-11" + assert resource["modified"] == "2012-05-01T00:04:06" + assert resource["status"] == "http://purl.org/adms/status/Completed" + assert resource["size"] == 12323 + assert ( + resource["availability"] + == "http://publications.europa.eu/resource/authority/planned-availability/EXPERIMENTAL" + ) + assert ( + resource["compress_format"] + == "http://www.iana.org/assignments/media-types/application/gzip" + ) + assert ( + resource["package_format"] + == "http://publications.europa.eu/resource/authority/file-type/TAR" + ) + + assert resource["hash"] == "4304cf2e751e6053c90b1804c89c0ebb758f395a" + assert ( + resource["hash_algorithm"] + == "http://spdx.org/rdf/terms#checksumAlgorithm_sha1" + ) + + assert ( + resource["access_url"] == "http://www.bgs.ac.uk/gbase/geochemcd/home.html" + ) + assert "download_url" not in resource + + # Resources: list fields + assert sorted(resource["language"]) == ["ca", "en", "es"] + assert sorted(resource["documentation"]) == [ + "http://dataset.info.org/distribution1/doc1", + "http://dataset.info.org/distribution1/doc2", + ] + assert sorted(resource["conforms_to"]) == ["Standard 1", "Standard 2"] + + # Resources: repeating subfields + assert resource["access_services"][0]["title"] == "Sparql-end Point" + assert resource["access_services"][0]["endpoint_url"] == [ + "http://publications.europa.eu/webapi/rdf/sparql" + ] + + +@pytest.mark.usefixtures("with_plugins", "clean_db", "clean_index") +@pytest.mark.ckan_config("ckan.plugins", "dcat scheming_datasets") +@pytest.mark.ckan_config( + "scheming.dataset_schemas", "ckanext.dcat.schemas:dcat_ap_2.1_full.yaml" +) +@pytest.mark.ckan_config( + "scheming.presets", + "ckanext.scheming:presets.json ckanext.dcat.schemas:presets.yaml", +) +@pytest.mark.ckan_config( + "ckanext.dcat.rdf.profiles", "euro_dcat_ap_2 euro_dcat_ap_scheming" +) +class TestSchemingIndexFields: + def test_repeating_subfields_index(self): + + dataset_dict = { + # Core fields + "name": "test-dataset", + "title": "Test DCAT dataset", + "notes": "Some notes", + # Repeating subfields + "contact": [ + {"name": "Contact 1", "email": "contact1@example.org"}, + {"name": "Contact 2", "email": "contact2@example.org"}, + ], + } + + with mock.patch("ckan.lib.search.index.make_connection") as m: + call_action("package_create", **dataset_dict) + + # Dict sent to Solr + search_dict = m.mock_calls[1].kwargs["docs"][0] + assert search_dict["extras_contact__name"] == "Contact 1 Contact 2" + assert ( + search_dict["extras_contact__email"] + == "contact1@example.org contact2@example.org" + ) + + def test_repeating_subfields_search(self): + + dataset_dict = { + # Core fields + "name": "test-dataset", + "title": "Test DCAT dataset", + "notes": "Some notes", + # Repeating subfields + "contact": [ + {"name": "Contact 1", "email": "contact1@example.org"}, + {"name": "Contact 2", "email": "contact2@example.org"}, + ], + } + + dataset = call_action("package_create", **dataset_dict) + + result = call_action("package_search", q="Contact 2") + + assert result["results"][0]["id"] == dataset["id"] + + result = call_action("package_search", q="Contact 3") + + assert result["count"] == 0 + + def test_spatial_field(self): + + dataset_dict = { + # Core fields + "name": "test-dataset", + "title": "Test DCAT dataset", + "notes": "Some notes", + "spatial_coverage": [ + { + "uri": "https://sws.geonames.org/6361390/", + "centroid": {"type": "Point", "coordinates": [1.26639, 41.12386]}, + }, + { + "geom": { + "type": "Polygon", + "coordinates": [ + [ + [11.9936, 54.0486], + [11.9936, 54.2466], + [12.3045, 54.2466], + [12.3045, 54.0486], + [11.9936, 54.0486], + ] + ], + }, + "text": "Tarragona", + }, + ], + } + + with mock.patch("ckan.lib.search.index.make_connection") as m: + call_action("package_create", **dataset_dict) + + # Dict sent to Solr + search_dict = m.mock_calls[1].kwargs["docs"][0] + assert search_dict["spatial"] == json.dumps( + dataset_dict["spatial_coverage"][0]["centroid"] + ) diff --git a/ckanext/dcat/tests/test_validators.py b/ckanext/dcat/tests/test_validators.py new file mode 100644 index 00000000..700cc644 --- /dev/null +++ b/ckanext/dcat/tests/test_validators.py @@ -0,0 +1,98 @@ +import datetime +import json + +import pytest + +from ckantoolkit import StopOnError, Invalid +from ckanext.dcat.validators import scheming_multiple_number, dcat_date + + +def test_scheming_multiple_number(): + + expected_value = [1.5, 2.0, 0.345] + + key = ("some_number_field",) + errors = {key: []} + + values = [ + expected_value, + [1.5, 2, 0.345], + ["1.5", "2", ".345"], + ] + for value in values: + data = {key: value} + scheming_multiple_number({}, {})(key, data, errors, {}) + + assert data[key] == json.dumps(expected_value) + + +def test_scheming_multiple_number_single_value(): + + expected_value = [1.5] + + key = ("some_number_field",) + errors = {key: []} + + values = [ + expected_value, + 1.5, + "1.5", + ] + for value in values: + data = {key: value} + scheming_multiple_number({}, {})(key, data, errors, {}) + + assert data[key] == json.dumps(expected_value) + + +def test_scheming_multiple_number_wrong_value(): + + key = ("some_number_field",) + errors = {key: []} + + values = [ + ["a", 2, 0.345], + ["1..5", "2", ".345"], + ] + for value in values: + with pytest.raises(StopOnError): + data = {key: value} + scheming_multiple_number({}, {})(key, data, errors, {}) + + assert errors[key][0].startswith("invalid type for repeating number") + + errors = {key: []} + + +def test_dcat_date_valid(): + + key = ("some_date",) + errors = {key: []} + valid_values = [ + datetime.datetime.now(), + "2024", + "2024-07", + "2024-07-01", + "1905-03-01T10:07:31.182680", + "2024-04-10T10:07:31", + "2024-04-10T10:07:31.000Z", + ] + + for value in valid_values: + data = {key: value} + dcat_date(key, data, errors, {}), value + + +def test_dcat_date_invalid(): + + key = ("some_date",) + errors = {key: []} + invalid_values = [ + "2024+07", + "not_a_date", + ] + + for value in invalid_values: + data = {key: value} + with pytest.raises(Invalid): + dcat_date(key, data, errors, {}), value diff --git a/ckanext/dcat/tests/utils.py b/ckanext/dcat/tests/utils.py index c62d9338..53618366 100644 --- a/ckanext/dcat/tests/utils.py +++ b/ckanext/dcat/tests/utils.py @@ -4,32 +4,32 @@ class BaseParseTest(object): - def _extras(self, dataset): extras = {} - for extra in dataset.get('extras'): - extras[extra['key']] = extra['value'] + for extra in dataset.get("extras"): + extras[extra["key"]] = extra["value"] return extras def _get_file_contents(self, file_name): - path = os.path.join(os.path.dirname(__file__), - '..', '..', '..', 'examples', - file_name) - with open(path, 'r') as f: + path = os.path.join( + os.path.dirname(__file__), "..", "..", "..", "examples", file_name + ) + with open(path, "r") as f: return f.read() class BaseSerializeTest(object): - def _extras(self, dataset): extras = {} - for extra in dataset.get('extras'): - extras[extra['key']] = extra['value'] + for extra in dataset.get("extras"): + extras[extra["key"]] = extra["value"] return extras def _triples(self, graph, subject, predicate, _object, data_type=None): - if not (isinstance(_object, URIRef) or isinstance(_object, BNode) or _object is None): + if not ( + isinstance(_object, URIRef) or isinstance(_object, BNode) or _object is None + ): if data_type: _object = Literal(_object, datatype=data_type) else: @@ -41,6 +41,15 @@ def _triple(self, graph, subject, predicate, _object, data_type=None): triples = self._triples(graph, subject, predicate, _object, data_type) return triples[0] if triples else None + def _triples_list_values(self, graph, subject, predicate): + return [str(t[2]) for t in graph.triples((subject, predicate, None))] + + def _triples_list_python_values(self, graph, subject, predicate): + return [ + t[2].value if isinstance(t[2], Literal) else str(t[2]) + for t in graph.triples((subject, predicate, None)) + ] + def _get_typed_list(self, list, datatype): """ returns the list with the given rdf type """ return [datatype(x) for x in list] @@ -48,6 +57,6 @@ def _get_typed_list(self, list, datatype): def _get_dict_from_list(self, dict_list, key, value): """ returns the dict with the given key-value """ for dict in dict_list: - if(dict.get(key) == value): + if dict.get(key) == value: return dict return None diff --git a/ckanext/dcat/validators.py b/ckanext/dcat/validators.py new file mode 100644 index 00000000..c9ee7d50 --- /dev/null +++ b/ckanext/dcat/validators.py @@ -0,0 +1,132 @@ +import datetime +import json +import re + +from dateutil.parser import parse as parse_date +from ckantoolkit import ( + missing, + StopOnError, + Invalid, + _, +) +from ckanext.scheming.validation import scheming_validator + +# https://www.w3.org/TR/xmlschema11-2/#gYear +regexp_xsd_year = re.compile( + "-?([1-9][0-9]{3,}|0[0-9]{3})(Z|(\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" +) + +# https://www.w3.org/TR/xmlschema11-2/#gYearMonth +regexp_xsd_year_month = re.compile( + "-?([1-9][0-9]{3,}|0[0-9]{3})-(0[1-9]|1[0-2])(Z|(\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" +) + +regexp_xsd_date = re.compile( + "-?([1-9][0-9]{3,}|0[0-9]{3})-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])(Z|(\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" +) + + +def is_year(value): + return regexp_xsd_year.fullmatch(value) + + +def is_year_month(value): + return regexp_xsd_year_month.fullmatch(value) + + +def is_date(value): + return regexp_xsd_date.fullmatch(value) + + +def dcat_date(key, data, errors, context): + value = data[key] + + if isinstance(value, datetime.datetime): + return + + if is_year(value) or is_year_month(value) or is_date(value): + return + + try: + parse_date(value) + except ValueError: + raise Invalid( + _( + "Date format incorrect. Supported formats are YYYY, YYYY-MM, YYYY-MM-DD and YYYY-MM-DDTHH:MM:SS" + ) + ) + + return value + + +@scheming_validator +def scheming_multiple_number(field, schema): + """ + Accept repeating numbers input in the following forms and convert to a + json list of decimal values for storage. Also act like scheming_required + to check for at least one non-empty string when required is true: + + 1. a list of numbers, eg. + + [22, 1.3] + + 2. a single number value to allow single text fields to be + migrated to repeating numbers + + 33.4 + + """ + + def _scheming_multiple_number(key, data, errors, context): + # just in case there was an error before our validator, + # bail out here because our errors won't be useful + if errors[key]: + return + + value = data[key] + if value and value is not missing: + + if not isinstance(value, list): + if isinstance(value, str) and value.startswith("["): + try: + value = json.loads(value) + except ValueError: + errors[key].append(_("Could not parse value")) + raise StopOnError + else: + try: + value = [float(value)] + except ValueError: + errors[key].append(_("expecting list of numbers")) + raise StopOnError + + out = [] + for element in value: + if not element: + continue + try: + element = float(element) + except ValueError: + errors[key].append( + _("invalid type for repeating number: %r") % element + ) + continue + + out.append(element) + + if errors[key]: + raise StopOnError + + data[key] = json.dumps(out) + + if (data[key] is missing or data[key] == "[]") and field.get("required"): + errors[key].append(_("Missing value")) + raise StopOnError + + return _scheming_multiple_number + + +dcat_validators = { + "scheming_multiple_number": scheming_multiple_number, + "dcat_date": dcat_date, +} diff --git a/examples/dataset.rdf b/examples/dataset.rdf index fed71cc9..f7db02db 100644 --- a/examples/dataset.rdf +++ b/examples/dataset.rdf @@ -3,6 +3,7 @@ xmlns:time="http://www.w3.org/2006/time#" xmlns:dct="http://purl.org/dc/terms/" xmlns:dcat="http://www.w3.org/ns/dcat#" + xmlns:dcatap="http://data.europa.eu/r5r/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:adms="http://www.w3.org/ns/adms#" xmlns:schema="http://schema.org/" @@ -36,6 +37,8 @@ Standard 2 + 1.5 + 2.0 public @@ -49,6 +52,10 @@ + https://doi.org/10.1038/sdata.2018.22 + test_isreferencedby + + @@ -56,6 +63,8 @@ 2013-01-05 + PT15M + P1D Point of Contact @@ -79,9 +88,12 @@ Some statement about rights + http://publications.europa.eu/resource/authority/planned-availability/EXPERIMENTAL http://www.bgs.ac.uk/gbase/geochemcd/home.html HTML text/html + http://www.iana.org/assignments/media-types/application/gzip + http://publications.europa.eu/resource/authority/file-type/TAR 12323 @@ -96,7 +108,19 @@ - + + + + Sparql-end Point + + This SPARQL end point allow to directly query the EU Whoiswho content (organization / membership / person) + SPARQL url description + + + + + + diff --git a/examples/dataset_gob_es.ttl b/examples/dataset_gob_es.ttl new file mode 100644 index 00000000..70742f62 --- /dev/null +++ b/examples/dataset_gob_es.ttl @@ -0,0 +1,52 @@ +@prefix adms: . +@prefix dcat: . +@prefix dct: . +@prefix foaf: . +@prefix gsp: . +@prefix locn: . +@prefix owl: . +@prefix rdf: . +@prefix rdfs: . +@prefix schema: . +@prefix skos: . +@prefix time: . +@prefix vcard: . +@prefix xml: . +@prefix xsd: . + + a dcat:Dataset ; + dct:accrualPeriodicity ; + dct:description "Estudio de satisfacción de las personas directivas de los centros educativos con las escuelas municipales de promoción deportiva en horario extraescolar, organizadas desde la Dirección General de Deporte en colaboración con las diferentes federaciones deportivas de las diferentes modalidades deportivas, mediante el análisis de los resultados obtenidos al realizar una encuesta enviada al personal directivo de cada centro escolar a su correo electrónico invitándoles a participar. Las Escuelas Municipales de Promoción Deportiva en centros escolares tienen como finalidad fomentar una práctica deportiva estable y continuada entre los escolares de diferentes modalidades deportivas y la posterior iniciación a la competición con la participación de las escuelas en los Juegos Deportivos Municipales. El objeto de este estudio es conocer el grado de satisfacción del personal directivo con las Escuelas Municipales de Promoción Deportiva (EMPD) impartidas en los centros escolares de la ciudad de Madrid, mediante la realización de diversas preguntas relacionadas con los servicios ofrecidos (organización, instalaciones y profesorado). De esta manera se pueden detectar ámbitos o actuaciones que precisen intervenciones de mejora para alcanzar el objetivo general del Plan de Calidad establecido en el Ayuntamiento de Madrid: garantizar la calidad de los servicios prestados a la ciudadanía y su mejora continua, logrando la satisfacción ciudadana y alcanzando una gestión pública cada vez más eficaz y eficiente, participativa y transparente."@es ; + dct:identifier "https://datos.madrid.es/egob/catalogo/300676-0-deporte-encuesta-escuelas" ; + dct:issued "2024-04-15T09:08:02+02:00"^^xsd:dateTime ; + dct:language "es" ; + dct:license ; + dct:modified "2024-04-15T09:08:10+02:00"^^xsd:dateTime ; + dct:publisher ; + dct:title "Estudio de Satisfacción con el Programa de Escuelas Municipales de Promoción Deportiva en centros escolares"@es ; + dcat:distribution ; + dcat:keyword "deporte escolar"@es, + "escuelas"@es, + "promoción"@es ; + dcat:theme . + + a skos:Concept ; + skos:notation "L01280796" ; + skos:prefLabel "Ayuntamiento de Madrid" . + + a time:DurationDescription ; + time:years 1.0 . + + a dct:Frequency ; + rdf:value . + + a dcat:Distribution ; + dct:format ; + dct:title "Estudio de satisfacción temporada 2023-2024"@es ; + dcat:accessURL "https://datos.madrid.es/egob/catalogo/300676-0-deporte-encuesta-escuelas.xlsx" ; + dcat:byteSize 43008.0 . + + a dct:IMT ; + rdfs:label "XLS" ; + rdf:value "application/vnd.ms-excel" . + diff --git a/examples/dataset_gov_de.rdf b/examples/dataset_gov_de.rdf new file mode 100644 index 00000000..f2640e94 --- /dev/null +++ b/examples/dataset_gov_de.rdf @@ -0,0 +1,263 @@ + + + + Liefer- und Abholservices (Gastronomie) + Dieser Datensatz umfasst die Standorte der Liefer- und Abholservices (Gastronomie) in der Hanse- und Universitätsstadt Rostock mit Informationen zu Adresse, Art, Bezeichnung, Barrierefreiheit, Öffnungszeiten und Kontaktdaten. Die Ressourcen werden nur bei Bedarf aktualisiert. + + c1be4007-d811-48fb-8818-2cb358b06a63 + + gastronomiebetriebe + handel + handel-und-verbrauch + lebensmittel + nahrung + nahrungsmittelgewerbe + wirtschaft + ökonomie + 2020-04-21T11:59:39.642528 + 2024-05-08T04:11:24.309486 + + + http://dcat-ap.de/def/dcatde/ + + + Hanse- und Universitätsstadt Rostock – Kataster-, Vermessungs- und Liegenschaftsamt + + + + + + Hanse- und Universitätsstadt Rostock + + + + + {"type":"Polygon","coordinates":[[[11.9936, 54.0486], [11.9936, 54.2466], [12.3045, 54.2466], [12.3045, 54.0486], [11.9936, 54.0486]]]} + POLYGON ((11.9936 54.0486, 11.9936 54.2466, 12.3045 54.2466, 12.3045 54.0486, 11.9936 54.0486)) + + + + + + + + + + Liefer- und Abholservices (Gastronomie) + + + + + + + http://dcat-ap.de/def/dcatde/ + + + 2020-04-21T11:59:53 + 2024-05-07T11:58:19 + 147253.0 + + + 1d5a3926327046c8fcd75652c6e44d3c74004c6727939f33fb05e16e324df241 + + + + + + + + + + Liefer- und Abholservices (Gastronomie) + + + + + + http://dcat-ap.de/def/dcatde/ + + 2020-04-21T11:59:53 + 2024-05-07T11:58:19 + + + + + + + Liefer- und Abholservices (Gastronomie) + + + + + + http://dcat-ap.de/def/dcatde/ + + 2020-04-21T11:59:53 + 2024-05-07T11:58:19 + + + + + + + Liefer- und Abholservices (Gastronomie) + Diese Daten umfassen die Standorte der Liefer- und Abholservices (Gastronomie) in der Hanse- und Universitätsstadt Rostock mit Informationen zu Adresse, Art, Bezeichnung, Öffnungszeiten und Kontaktdaten. + + + + + + http://dcat-ap.de/def/dcatde/ + 2023-02-08T12:25:41 + 2024-05-07T11:58:19 + + + + + + + Liefer- und Abholservices (Gastronomie) + Diese Karte umfasst die Standorte der Liefer- und Abholservices (Gastronomie) in der Hanse- und Universitätsstadt Rostock. + + + + + + http://dcat-ap.de/def/dcatde/ + 2023-02-08T12:25:41 + 2024-05-07T11:58:19 + + + + + + + Liefer- und Abholservices (Gastronomie) + + + + + + + http://dcat-ap.de/def/dcatde/ + + + 2020-04-21T11:59:53 + 2024-05-07T11:58:19 + 27256.0 + + + f2f17b94406047f148bfa41fb5945eb9f2023d6661a3806031baba519de7ba73 + + + + + + + + + + Liefer- und Abholservices (Gastronomie) + + + + + + + http://dcat-ap.de/def/dcatde/ + + + 2020-04-21T11:59:53 + 2024-05-07T11:58:19 + 60744.0 + + + 7cc6d258c20d5b5b507c8466900a381de8a349bb1a8118361c1a613e1b59fb14 + + + + + + + + + + Liefer- und Abholservices (Gastronomie) + + + + + + + http://dcat-ap.de/def/dcatde/ + + + 2020-04-21T11:59:53 + 2024-05-07T11:58:19 + 308724.0 + + + a407dd69e91861d98589ea9b14c39aa6972d56c2ce52d2dd3f0e0be8922fc6b7 + + + + + + + + + + Liefer- und Abholservices (Gastronomie) + + + + + + + http://dcat-ap.de/def/dcatde/ + + + 2020-04-21T11:59:53 + 2024-05-07T11:58:19 + 241844.0 + + + 788e8c33012ee44fbbfa1589277579fdc34211a6326962624789e182680307f2 + + + + + + + + + + + + + Rostock + + + Hanse- und Universitätsstadt Rostock – Kataster-, Vermessungs- und Liegenschaftsamt + geodienste@rostock.de + + + + + Hanse- und Universitätsstadt Rostock – Kataster-, Vermessungs- und Liegenschaftsamt + geodienste@rostock.de + + + + diff --git a/setup.py b/setup.py index fda14619..78fb19fa 100644 --- a/setup.py +++ b/setup.py @@ -43,6 +43,7 @@ [ckan.rdf.profiles] euro_dcat_ap=ckanext.dcat.profiles:EuropeanDCATAPProfile euro_dcat_ap_2=ckanext.dcat.profiles:EuropeanDCATAP2Profile + euro_dcat_ap_scheming=ckanext.dcat.profiles:EuropeanDCATAPSchemingProfile schemaorg=ckanext.dcat.profiles:SchemaOrgProfile [babel.extractors]