Skip to content

Latest commit

 

History

History
396 lines (269 loc) · 30.2 KB

usage.md

File metadata and controls

396 lines (269 loc) · 30.2 KB

Using table2qb

table2qb is a utility for specifying and generating elements of an RDF Data Cube. A data cube contains a collection of homogeneous statistical observations along with a definition of their structure. Each observation is identified by a collection of dimension values corresponding to one or more observed measure values along with an optional set of attributes which allow further interpretation of the observed value(s). table2qb exposes a number of 'pipelines' which generate the various elements that comprise a cube. A pipeline is a command which takes a number of named arguments and outputs RDF data either to a file or to the standard output stream. This RDF data can then be inserted into an RDF data store for further processing.

Running and getting help

Once installed, table2qb can be run via the cli alias:

clojure -M:cli

Running without any further arguments outputs a brief help message which describes the tasks that table2qb provides. A task is a sub-command exposing some functionality with its own arguments. For example the help task displays the help for a particular task e.g.

clojure -M:cli help list

describes how to use the list task.

Creating components

An observation consists of a number of dimensions, one or more measures and an optional set of attributes - collectively these are referred to as components. Before being referenced by an observation, these components must be defined. Some components you wish to reference (e.g. sdmx-dimension:refArea) may already be defined by existing vocabularies, but you may have additional components specific to your organisation you wish to define. These can be created using the components-pipeline. Components are defined by the rows of a CSV file containing the following columns:

  • Label: The name of the component. Note that various properties of the generated component are derived from this field (described below).
  • Notation: An optional string that may be used in the component's URI template. If not provided, this column is derived by slugging the Label.
  • Description: Textual description of the component.
  • Component Type: One of Dimension, Measure or Attribute which specifies which component type is being defined.
  • Codelist: An optional URI for the corresponding code list. A code list enumerates the possible value a component can take within an observation and can be generated by the codelist-pipeline.

The employment example contains a component definitions file defining two components: a gender dimension and a count measure. The components-pipeline is run by providing the components file along with a base URI used to construct the URIs for the generated components and their properties. This will usually be some sub-path of the linked data domain you will be hosting your cubes under. The components-pipeline is run using the exec task:

clojure -M:cli exec components-pipeline --input-csv path/to/components.csv --base-uri http://example.com/ --output-file components.ttl

Running with the employment example components results in two components being defined in the output components.ttl file. Within components.ttl the Gender dimension is defined with the following properties:

<http://example.com/def/dimension/gender> rdfs:label "Gender" .
<http://example.com/def/dimension/gender> dcterms:description "The state of being male or female" .
<http://example.com/def/dimension/gender> a qb:DimensionProperty .
<http://example.com/def/dimension/gender> qb:codeList <http://statistics.gov.scot/def/concept-scheme/gender> .
<http://example.com/def/dimension/gender> skos:notation "gender" .
<http://example.com/def/dimension/gender> rdfs:range <http://example.com/def/Gender> .
<http://example.com/def/dimension/gender> rdfs:isDefinedBy <http://example.com/def/ontology/components> .
<http://example.com/def/dimension/gender> a rdf:Property .

There are a few things to note about the resulting component:

  • The component URI is derived from the component Label and the provided base-uri value. The label is converted into a slug and then combined with the base URI and component type (dimension, measure or attribute) as {base-uri}/def/{component-type}/{slugged-label}. For example, the gender dimension has a URI of http://example.com/def/dimension/gender, the count measure http://example.com/def/measure/count, a Trade Currency attribute would be http://example.com/def/attribute/trade-currency etc.
  • An rdf:type property is defined as qb:DimensionProperty, qb:MeasureProperty or qb:AttributeProperty depending on whether the Component Type is Dimension, Measure or Attribute respectively.
  • The value of the skos:notation property is the slugized version of the label.
  • The value of the rdfs:range property is derived from the base URI and the classized version of the label. The resulting value URI is {base-uri}/def/{classized-label}.

The Measure component is defined similarly except its type is qb:MeasureProperty and it has no associated qb:codeList property.

The URI structure of some component properties will be made more configurable in a future version.

Creating codelists

The values of certain dimension or attribute components within a data cube may be enumerated in a set of values. For such components, a code list should be defined containing the possible values. In line with the recommendations within the RDF data cube specification, table2qb generates a skos:ConceptScheme from a CSV file defining the values and (where present) the hierarchical structure of the codelist. The definition CSV file should contain the following columns:

  • Label: Label for the concept.
  • Notation: A unique value used to identify the code (such as a URN or alpha-numeric code). By default, the notation is used to generate the corresponding concept URI so it should only contain URI-compatible characters (you may also customise the URI pattern). If this column is missing, it will be derived by slugging the Label, e.g. 3 Mineral Fuels becomes 3-mineral-fuels.
  • Parent Notation: Codelists can be hierarchical where one entry represents a specialisation of another entry in the list. The parent notation should contain the notation of the broader concept in the list if one exists.
  • Description: Textual description of the concept.
  • Sort Priority: Optional numeric value indicating the position of the value within the code list. Some user interfaces may use this value to sort the code list values for display purposes.

The employment example contains a gender codelist. Note that the optional Description and Sort Priority columns are missing. This file can be used to generate the codelist with codelist-pipeline:

clojure -M:cli exec codelist-pipeline --codelist-csv path/to/gender-codelist.csv --codelist-name Gender --codelist-slug gender --base-uri http://example.com/ --output-file gender.ttl

This generates a skos:ConceptScheme for gender containing members for the All, Female and Male members within the codelist CSV file. The codelist is defined as:

<http://example.com/def/concept-scheme/gender> dcterms:title "Gender"@en ;
        rdfs:label "Gender"@en ;
        a skos:ConceptScheme ;

The URI of the codelist is constructed from the base-uri and codelist-slug parameters provided to the codelist-pipeline invocation above. The constructed URI has the form {base-uri}/def/concept-scheme/{codelist-slug}. Note that when defining components using components-pipeline, any associated codelist URI for the component must match the one generated by codelist-pipeline (or the URI of another Concept Scheme).

The members of the concept scheme have URIs of the form {base-uri}/def/concept-scheme/{codelist-slug}/{notation} where notation is the corresponding value of the Notation column within the codelist CSV file. Member URIs have a prefix of the containing codelist URI. Along with some additional properties, the Female member of the Gender codelist is defined as:

<http://example.com/def/concept/gender/female> rdfs:label "Female" .
<http://example.com/def/concept/gender/female> skos:broader <http://example.com/def/concept/gender/all> .
<http://example.com/def/concept/gender/female> skos:inScheme <http://example.com/def/concept-scheme/gender> .
<http://example.com/def/concept-scheme/gender> skos:member <http://example.com/def/concept/gender/female> .

The member is connected to its containing scheme through the skos:inScheme and skos:member properties. Since the Female member has a Parent Notation of all it has a skos:broader relationship with the corresponding all member within the scheme.

Creating cubes

After defining components and any associated codelists, data cubes can be created by the cube-pipeline given a file containing observation data. The observation table should be arranged in tidy data format i.e. one row per observation with one column per component (dimension, attribute or measure). An example observation file can be seen in the employment example. Along with the observations data, the cube-pipeline requires a configuration file which describes the meaning of each column in the data and how to process the cells within. This configuration file should contain the following columns:

  • title: Identifies a column heading within the observations data. This must be unique, each row of the configuration file (and hence each column of an observations file) must have a different title.
  • name: The variable name by which the column may be refered to within URI templates. It is recommended this value is the lower-cased version of the title with spaces replaced with underscores e.g. the name of the Measure Type column would be measure_type. Names should be unique within the configuration. Hyphens aren't permitted in URI templates.
  • component_attachment: Either blank, or one of qb:dimension, qb:measure or qb:attribute to indicate whether the column defines a dimension, attribute or measure of the observations. If blank, the column is assumed to contain observation values and will be attached to observations using the relevant measure property (see Measure dimension cubes for more details).
  • property_template: template for building the component property URIs used to link the corresponding value to the observation.
  • value_template: Optional URI template for component values.
  • datatype: Datatype of the values within the corresponding column.
  • value_transformation: Optional transformation to apply to cell values in the corresponding column. If specified it should be either slugize or unitize. More transformations may be offered in future.

An example column configuration file is defined in the employment example. The configuration must define all columns to be used within an observations file, but can also contain definitions for additional columns that do not exist. This means that all known columns for multiple different cubes can be defined within a single definition file (subject to the constraints that the values within the title and name columns must be unique).

Cube types

Observations within a cube are distinguished by the set of dimension values, but may have multiple associated measures. The data cube specification suggests two approaches to handling multiple measures. One is to associate a single measure value with each observation and to include an explicit "measure type" dimension which indicates which measure is being used. Such cubes are henceforth referred to as "measure dimension" cubes. The other approach - "multi-measure" cubes - associates a value for each measure to each observation.

Measure dimension cubes

A 'measure dimension' cube is one where each observation has a single measure and a qb:measureType dimension indicating which measure the observation corresponds to. If the cube contains multiple measures this means there should be multiple observations for each combination of dimension values in the observations data (note table2qb does not validate this requirement, see validation for validating generated cubes). table2qb requires the following constraints are met by the observations data:

  • A single column exists with a property_template of http://purl.org/linked-data/cube#measureType. Note this is the literal value which must be used; the compact form of qb:measureType is not accepted.
  • Each cell in the measure type column must identify the associated measure by its title. The identified column must have a component_attachment of qb:measure.
  • A single value column must exist. A value column is one with an empty component_attachment within the columns configuration. This should contain the value for the associated measure type.

The employment example observation file defines a measure dimension cube where:

  • The Measure Type column has a property_template of http://purl.org/linked-data/cube#measureType.
  • The values in the Measure Type column reference the corresponding qb:measure column corresponding to the measure (by its title property). There is only a single measure used in the cube (i.e. the Count measure).
  • The Value column contains the measure value. The configuration for this column has an empty component_attachment.

Multi-measure cubes

A multi-measure cube is one where each observation has one or more associated measure properties and therefore no qb:measureType property to indicate the measure type (since they are all associated with the observation). table2qb uses the absence of a Measure Type column in the observations CSV to identify a cube as multi-measure. In that case, the columns configuration and input observations CSV must meet the following constraints:

  • No columns in the observations CSV are configured with a property_template of http://purl.org/linked-data/cube#measureType in the columns configuration
  • At least one column in the observations data has a component_attachment of qb:measure
  • All of the columns in the observations have a non-empty component_attachment i.e. no Value column is present.

The multi-measure example observations file defines a multi-measure cube where:

  • There are no columns with a property_template of http://purl.org/linked-data/cube#measureType - this identifies the cube as multi-measure.
  • There are two columns, 'Count' and 'GBP Total' which have a component_attachment of qb:measure in the columns configuration.
  • There are no Value columns (Date, Geography and Flow are qb:dimension columns)

In the resulting cube, each qb:Observation will have two associated measure properties defined by the property_template of the corresponding column definitions e.g.

<http://example.com/data/test/2011/gb/import> a qb:Observation;
  qb:dataSet <http://example.com/data/test>;
  <http://gss-data.org.uk/def/measure/count> 1.0E3;
  <http://gss-data.org.uk/def/measure/gbp-total> 2.0E4;
  ...

Running the cube-pipeline

Given an observations file and a columns configuration file, the cube-pipeline can be run:

clojure -M:cli exec cube-pipeline --input-csv path/to/input.csv --dataset-name Dataset --dataset-slug dataset --column-config path/to/column-configuration.csv --base-uri http://example.com/ --output-file cube.ttl

The URI of the generated cube will have the form {base-uri}/data/{dataset-slug} where dataset-slug is the value of the parameter provided cube-pipeline. The cube will have a title matching the dataset-name parameter. A qb:Observation is generated for each row in the observations CSV data. The observation corresponding to the first row of the observations within the employment example is:

<http://example.com/data/employment/S12000039/2017-Q1/female/count> a qb:Observation .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> qb:dataSet <http://example.com/data/employment> .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> sdmx-dimension:refArea <http://statistics.data.gov.uk/id/statistical-geography/S12000039> .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> sdmx-dimension:refPeriod <http://reference.data.gov.uk/id/quarter/2017-Q1> .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> <http://statistics.gov.scot/def/dimension/gender> <http://statistics.gov.scot/def/concept/gender/female> .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> qb:measureType <http://statistics.gov.scot/def/measure/count> .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> sdmx-attribute:unitMeasure <http://statistics.gov.scot/def/concept/measure-units/people> .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> <http://statistics.gov.scot/def/measure/count> 2.07E4 .

The URI of the observation is derived from the base-uri parameter and the dimension and measure values for the observation. The properties linking the observation to the corresponding component values are constructed from the property_template URI template in the column specification file. Similarly the corresponding values are either literals based on the declared data type, or URIs derived from the value_template column. To illustrate, the value of the Gender column for this observation in the observation data is Female. The column specification for the Gender column defines a property_template of http://statistics.gov.scot/def/dimension/gender and a value_template of http://statistics.gov.scot/def/concept/gender/{gender}. Note the gender column is referenced by its name (not title) within the value URI template. The column also defines a value_transformation of slugize so observation values are converted into URI slugs before being incorporated into the value template. These are combined to produce the statement

<http://example.com/data/employment/S12000039/2017-Q1/female/count> <http://statistics.gov.scot/def/dimension/gender> <http://statistics.gov.scot/def/concept/gender/female> .

shown above. If a codelist has been generated by the codelist-pipeline, care must be taken to ensure the value_template for the associated dimension matches the format of the URI for generated members.

URIs

table2qb pipelines output RDF data which frequently uses URIs to identify resources. table2qb allows some customisation in the way URIs are generated through URI templates and various transformation on the input data. These are described below.

URI Templates

table2qb allows some URIs to be parameterised by input data, such as the property_template and value_template of the data cube columns configuration file. The format of these templates are defined by RFC 6570 - URI Templates however the majority of templates will use relatively basic features. The most common usage is to parameterise URIs by the values of various columns e.g.

http://example/def/concept/gender/{gender}

This references the gender column, which should be defined within the columns configuration. Note referencing columns within URI templates is done by the column name and not its title (i.e. gender instead of Gender).

Transforms

URIs are frequently used to identify linked data resources and table2qb generates URIs in various places for this purpose. URIs place restrictions on the permissible characters within each component, and users have additional expectations around the conventions used when building URIs. Since URIs components can be constructed from free-form text, table2qb applies various transforms to the input data before incorporating them into URI templates.

Slugize

Input text is converted into a 'slugged' version using the slugize transformation. This transformation is defined as:

  1. Convert the input string to lower-case
  2. Replace any non alphabetical characters with a - character
  3. Replace sequences of - with a single - character
  4. Remove any trailing -

For example "Gender" will be converted to gender, "Export and Import Activity" to export-and-import-activity.

Unitize

The unitize transformation is defined as:

  1. Replace £ characters with GBP
  2. Follow the slugize transformation

For example the text £ 10000 is converted into gbp-10000.

Classize

The classize transformation is defined as:

  1. Capitalise the first letter of each word
  2. Remove whitespace around words

For example the text "date of birth" is converted into DateOfBirth.

Note this transformation is only used internally for generating some URIs and is not a valid value for the value_transformation in the data cube columns configuration.

Propertize

The propertize transformation is defined as:

  1. Lower-case the first letter of the first word
  2. Upper-case the first letter of all other words
  3. Remove the whitespace around words

For example the test "date of birth" is converted into dateOfBirth.

Note this transformation is only used internally for generating some URIs and is not a valid value for the value_transformation in the data cube columns configuration.

Customising URIs

Table2qb defines default conventions for the structure of URIs for generated resources, for example the default structure of a component URI was shown above:

{base-uri}/def/{component-type}/{slugged-label}

Each pipeline defines their own set of named URI templates which it uses when defining resources such as components and codelists. The default URI templates can be displayed using the uris task, e.g.

clojure -M:cli uris components-pipeline

This displays the default structure of the URIs defined within the components-pipeline. Each entry has a name and template containing two types of variable - template variables and CSVW variables.

Template variables

Template variables have the form $(variable-name). These are not defined by the URI templates specification and are expanded by each pipeline based on its required input parameters. The result of this exapansion should be a valid URI or URI template. As an example, the components-pipeline defines the following default template for the URIs of generated components:

:component-uri       $(base-uri)/def/{component_type_slug}/{notation}

base-uri is a required parameter for the components-pipeline so the resulting :component-uri template will be the result of replacing $(base-uri) with the provided base-uri parameter value when invoking the pipeline e.g.

clojure -M:cli exec components-pipeline --base-uri http://my-linked-domain ...

will result in :component-uri being expanded to http://my-linked-domain/def/{component_type_slug}/{notation}.

CSVW variables

CSVW variables have the form {variable_name} and reference columns within the intermediate CSVW table2qb uses to generate RDF. These support the operators defined within the URI templates specification.

Overriding defaults

Users can override the default URI templates by providing their own definitions in an EDN file containing a map of name to template. For component-uri example above this file could be defined as:

custom_uris.edn

{:component-uri "$(base-uri)/id/components/{notation}"}

No all URIs for a pipeline need to be defined within the file, any not specified will use the default value. The effective URI templates for the pipeline can be output by providing this file to the uris task e.g.

clojure -M:cli uris components-pipeline custom_uris.edn

This displays the URI templates that would be used by the pipeline on execution if the specified URIs file was provided.

If you are happy with the resulting URI structure, you can run the pipeline by specifying the optional --uris parameter e.g.

clojure -M:cli exec components-pipeline --base-uri http://example.com --uris custom_uris.edn ...

Things to check

The component, codelist and cube pipelines are run independently but resources created in one may be referenced in another. Below are some areas where care should be taken to ensure URIs generated by one pipeline match those referenced by another.

The property-template of columns in the cube-pipeline should be valid components

When calling the cube-pipeline, you identify component-properties with the property_template of the columns-configuration. This template needs to match either an existing component property (such as sdmx-dimension:sex) or a component that you've created with the components-pipeline.

The components-pipeline will, by default, create components with URIs of the form {base-uri}/def/(dimension|measure|attribute)/{component-slug} where component-slug is the slugized version of the component label. You can change this behaviour by customising the URI configuration of the :component-uri.

For example, if you were to use the components-pipeline to create a component with the Label of "SITC Section" you would need to ensure that any columns configured to use this component in the cube-pipeline had the property-template http://example.com/def/dimension/sitc-section.

The value-template of dimension columns in the cube-pipeline should be valid codes

When calling the cube-pipeline, you identify dimension-values (typically codes) with the value_template of dimensions in the columns-configuration. This template needs to match either an existing code (such as sdmx-code:sex-T) or a code that you've created with the codelist-pipeline.

The codelist-pipeline will, by default, create codes with URIs of the form {base-uri}/def/concept/{codelist-slug}/{notation} where the codelist-slug is provided as an argument when running the pipeline, and notation is either provided as a column or (if missing) derived by slugizing the label. This can be customised by changing the :code-uri template.

You can match these URIs by configuring a value_template like {base-uri}/def/concept/{codelist-slug}/{column_name}, where column_name refers to the value of the column's name field (e.g. sex). If you've not provided a notation column to the codelist-pipeline, these values will need to be slugized. You can set the column's value_transformation to slugize to do this within table2qb.

For example, if you were to use the codelist-pipeline to create a code with a label of "0 Food and Live Animals", you would need to ensure that any columns configured to use this code in the cube-pipeline had a name with sitc_section, a value-template with http://example.com/def/concept/sitc-sections/{sitc_section} and a value_transformation with slugize.

A component's codelist should be valid

You can also specify the qb:codeList using the Codelist field in the component-pipeline. This ought to correspond to an existing codelist (such as sdmx-code:sex or one you've created with the codelist-pipeline.

The codelist-pipeline will, by deafult, create codelists with URIs of the form {base-uri}/def/concept-scheme/{codelist-slug} where codelist-slug is provided as an argument when running the pipeline. This can be customised by changing the :codelist-uri template.

For example, if you were to use the codelist-pipeline to create a codelist with the codelist-slug of sitc-sections you could call the components-pipeline with a codelist of http://example.com/def/concept-scheme/sitc-sections.

Codes-used codelist URI needs to match

You only need to worry about this if you're customising the used-codes codelist URI.

The cube-pipeline generates an skos:Collection enumerating the codes that are used in each dimension. We create the skos:Collection itself and then populate it with skos:member triples in different parts of the pipeline.

The collection is created in the context of the csvw files that have one row per component. The template for the URI is configured with used-codes-codelist-uri-from-component defaulting to $(base-uri)/data/$(dataset-slug)/codes-used/{component_slug}.

The collection is populated in the context of the csvw files that have one row per observation. The template for the URI is configured with used-codes-codelist-uri-from-observation defaulting to $(base-uri)/data/$(dataset-slug)/codes-used/{_name} (note that _name refers to the columns name).

If you customise one or other of these templates, you need to ensure that they match.

Validation

The pipelines used to define data cubes are run independently and table2qb makes no attempt to validate that the various elements are defined and referenced consistently. For example, users must ensure that the codelist URI for a component matches the one generated for the concept scheme created by the codelist-pipeline. It is therefore recommended that data cubes are validated for consistency after they have been generated. One such tool for validating RDF data is rdf-validator. This supports validating that a generated cube conforms to the RDF data cube specification.