- Running and getting help
- Creating components
- Creating codelists
- Creating cubes
- URIs
- Things to check
- Validation
table2qb
is a utility for specifying and generating elements of an RDF Data Cube. A data cube contains
a collection of homogeneous statistical observations along with a definition of their structure. Each observation is identified by a collection of
dimension values corresponding to one or more observed measure values along with an optional set of attributes which allow further interpretation
of the observed value(s). table2qb
exposes a number of 'pipelines' which generate the various elements that comprise a cube. A pipeline is a command
which takes a number of named arguments and outputs RDF data either to a file or to the standard output stream. This RDF data can then be inserted into
an RDF data store for further processing.
Once installed, table2qb
can be run via the cli
alias:
clojure -M:cli
Running without any further arguments outputs a brief help message which describes the tasks that table2qb
provides. A task is a sub-command
exposing some functionality with its own arguments. For example the help
task displays the help for a particular task e.g.
clojure -M:cli help list
describes how to use the list
task.
An observation consists of a number of dimensions, one or more measures and an optional set of attributes - collectively these are referred to
as components. Before being referenced by an observation, these components must be defined. Some components you wish to reference
(e.g. sdmx-dimension:refArea) may already be defined by existing vocabularies, but
you may have additional components specific to your organisation you wish to define. These can be created using the components-pipeline
.
Components are defined by the rows of a CSV file containing the following columns:
Label
: The name of the component. Note that various properties of the generated component are derived from this field (described below).Notation
: An optional string that may be used in the component's URI template. If not provided, this column is derived by slugging the Label.Description
: Textual description of the component.Component Type
: One ofDimension
,Measure
orAttribute
which specifies which component type is being defined.Codelist
: An optional URI for the corresponding code list. A code list enumerates the possible value a component can take within an observation and can be generated by the codelist-pipeline.
The employment example contains a component definitions file defining two components: a gender
dimension and a count measure. The components-pipeline
is run by providing the components file along with a base URI used to construct the URIs for the generated components
and their properties. This will usually be some sub-path of the linked data domain you will be hosting your cubes under. The components-pipeline
is run using the exec
task:
clojure -M:cli exec components-pipeline --input-csv path/to/components.csv --base-uri http://example.com/ --output-file components.ttl
Running with the employment example components results in two components being defined in the output components.ttl
file. Within components.ttl
the Gender
dimension is defined with the following properties:
<http://example.com/def/dimension/gender> rdfs:label "Gender" .
<http://example.com/def/dimension/gender> dcterms:description "The state of being male or female" .
<http://example.com/def/dimension/gender> a qb:DimensionProperty .
<http://example.com/def/dimension/gender> qb:codeList <http://statistics.gov.scot/def/concept-scheme/gender> .
<http://example.com/def/dimension/gender> skos:notation "gender" .
<http://example.com/def/dimension/gender> rdfs:range <http://example.com/def/Gender> .
<http://example.com/def/dimension/gender> rdfs:isDefinedBy <http://example.com/def/ontology/components> .
<http://example.com/def/dimension/gender> a rdf:Property .
There are a few things to note about the resulting component:
- The component URI is derived from the component
Label
and the providedbase-uri
value. The label is converted into a slug and then combined with the base URI and component type (dimension, measure or attribute) as{base-uri}/def/{component-type}/{slugged-label}
. For example, the gender dimension has a URI ofhttp://example.com/def/dimension/gender
, the count measurehttp://example.com/def/measure/count
, aTrade Currency
attribute would behttp://example.com/def/attribute/trade-currency
etc. - An
rdf:type
property is defined asqb:DimensionProperty
,qb:MeasureProperty
orqb:AttributeProperty
depending on whether theComponent Type
isDimension
,Measure
orAttribute
respectively. - The value of the
skos:notation
property is the slugized version of the label. - The value of the
rdfs:range
property is derived from the base URI and the classized version of the label. The resulting value URI is{base-uri}/def/{classized-label}
.
The Measure
component is defined similarly except its type is qb:MeasureProperty
and it has no associated qb:codeList
property.
The URI structure of some component properties will be made more configurable in a future version.
The values of certain dimension or attribute components within a data cube may be enumerated in a set of values. For such
components, a code list should be defined containing the possible values. In line with the recommendations within the
RDF data cube specification, table2qb
generates a skos:ConceptScheme
from a CSV
file defining the values and (where present) the hierarchical structure of the codelist. The definition CSV file should contain the following columns:
Label
: Label for the concept.Notation
: A unique value used to identify the code (such as a URN or alpha-numeric code). By default, the notation is used to generate the corresponding concept URI so it should only contain URI-compatible characters (you may also customise the URI pattern). If this column is missing, it will be derived by slugging the Label, e.g.3 Mineral Fuels
becomes3-mineral-fuels
.Parent Notation
: Codelists can be hierarchical where one entry represents a specialisation of another entry in the list. The parent notation should contain the notation of the broader concept in the list if one exists.Description
: Textual description of the concept.Sort Priority
: Optional numeric value indicating the position of the value within the code list. Some user interfaces may use this value to sort the code list values for display purposes.
The employment example contains a gender codelist. Note that the optional Description
and Sort Priority
columns
are missing. This file can be used to generate the codelist with codelist-pipeline
:
clojure -M:cli exec codelist-pipeline --codelist-csv path/to/gender-codelist.csv --codelist-name Gender --codelist-slug gender --base-uri http://example.com/ --output-file gender.ttl
This generates a skos:ConceptScheme
for gender containing members for the All
, Female
and Male
members within the codelist CSV file. The codelist is defined as:
<http://example.com/def/concept-scheme/gender> dcterms:title "Gender"@en ;
rdfs:label "Gender"@en ;
a skos:ConceptScheme ;
The URI of the codelist is constructed from the base-uri
and codelist-slug
parameters provided to the codelist-pipeline
invocation above. The constructed URI has the
form {base-uri}/def/concept-scheme/{codelist-slug}
. Note that when defining components using components-pipeline
, any associated codelist URI for the component must match
the one generated by codelist-pipeline
(or the URI of another Concept Scheme).
The members of the concept scheme have URIs of the form {base-uri}/def/concept-scheme/{codelist-slug}/{notation}
where notation
is the corresponding value of the Notation
column within the codelist CSV file. Member URIs have a prefix of the containing codelist URI. Along with some additional properties, the Female
member of the Gender
codelist is defined as:
<http://example.com/def/concept/gender/female> rdfs:label "Female" .
<http://example.com/def/concept/gender/female> skos:broader <http://example.com/def/concept/gender/all> .
<http://example.com/def/concept/gender/female> skos:inScheme <http://example.com/def/concept-scheme/gender> .
<http://example.com/def/concept-scheme/gender> skos:member <http://example.com/def/concept/gender/female> .
The member is connected to its containing scheme through the skos:inScheme
and skos:member
properties. Since the Female
member has a Parent Notation
of all
it has a
skos:broader
relationship with the corresponding all
member within the scheme.
After defining components and any associated codelists, data cubes can be created by the cube-pipeline
given a file containing observation data. The observation table should be arranged
in tidy data format i.e. one row per observation with one column per component (dimension, attribute or measure). An example observation file can
be seen in the employment example. Along with the observations data, the cube-pipeline
requires a configuration file which describes the meaning of each
column in the data and how to process the cells within. This configuration file should contain the following columns:
title
: Identifies a column heading within the observations data. This must be unique, each row of the configuration file (and hence each column of an observations file) must have a different title.name
: The variable name by which the column may be refered to within URI templates. It is recommended this value is the lower-cased version of thetitle
with spaces replaced with underscores e.g. the name of theMeasure Type
column would bemeasure_type
. Names should be unique within the configuration. Hyphens aren't permitted in URI templates.component_attachment
: Either blank, or one ofqb:dimension
,qb:measure
orqb:attribute
to indicate whether the column defines a dimension, attribute or measure of the observations. If blank, the column is assumed to contain observation values and will be attached to observations using the relevant measure property (see Measure dimension cubes for more details).property_template
: template for building the component property URIs used to link the corresponding value to the observation.value_template
: Optional URI template for component values.datatype
: Datatype of the values within the corresponding column.value_transformation
: Optional transformation to apply to cell values in the corresponding column. If specified it should be either slugize or unitize. More transformations may be offered in future.
An example column configuration file is defined in the employment example. The configuration must define all columns to be used within an observations file, but can also contain definitions
for additional columns that do not exist. This means that all known columns for multiple different cubes can be defined within a single definition file (subject to the constraints that the values within the title
and name
columns must be unique).
Observations within a cube are distinguished by the set of dimension values, but may have multiple associated measures. The data cube specification suggests two approaches to handling multiple measures. One is to associate a single measure value with each observation and to include an explicit "measure type" dimension which indicates which measure is being used. Such cubes are henceforth referred to as "measure dimension" cubes. The other approach - "multi-measure" cubes - associates a value for each measure to each observation.
A 'measure dimension' cube is one where each observation has a single measure and a qb:measureType
dimension indicating which measure the observation corresponds to. If the cube contains multiple measures this means there
should be multiple observations for each combination of dimension values in the observations data (note table2qb
does not validate this requirement, see validation for validating generated cubes).
table2qb
requires the following constraints are met by the observations data:
- A single column exists with a
property_template
ofhttp://purl.org/linked-data/cube#measureType
. Note this is the literal value which must be used; the compact form ofqb:measureType
is not accepted. - Each cell in the measure type column must identify the associated measure by its
title
. The identified column must have acomponent_attachment
ofqb:measure
. - A single value column must exist. A
value
column is one with an emptycomponent_attachment
within the columns configuration. This should contain the value for the associated measure type.
The employment example observation file defines a measure dimension cube where:
- The
Measure Type
column has aproperty_template
ofhttp://purl.org/linked-data/cube#measureType
. - The values in the
Measure Type
column reference the correspondingqb:measure
column corresponding to the measure (by itstitle
property). There is only a single measure used in the cube (i.e. theCount
measure). - The
Value
column contains the measure value. The configuration for this column has an emptycomponent_attachment
.
A multi-measure cube is one where each observation has one or more associated measure properties and therefore no qb:measureType
property to indicate the measure type (since they are all associated with the observation).
table2qb
uses the absence of a Measure Type
column in the observations CSV to identify a cube as multi-measure. In that case, the columns configuration and input observations CSV must meet the following constraints:
- No columns in the observations CSV are configured with a
property_template
ofhttp://purl.org/linked-data/cube#measureType
in the columns configuration - At least one column in the observations data has a
component_attachment
ofqb:measure
- All of the columns in the observations have a non-empty
component_attachment
i.e. noValue
column is present.
The multi-measure example observations file defines a multi-measure cube where:
- There are no columns with a
property_template
ofhttp://purl.org/linked-data/cube#measureType
- this identifies the cube as multi-measure. - There are two columns, 'Count' and 'GBP Total' which have a
component_attachment
ofqb:measure
in the columns configuration. - There are no
Value
columns (Date, Geography and Flow areqb:dimension
columns)
In the resulting cube, each qb:Observation
will have two associated measure properties defined by the property_template
of the corresponding
column definitions e.g.
<http://example.com/data/test/2011/gb/import> a qb:Observation;
qb:dataSet <http://example.com/data/test>;
<http://gss-data.org.uk/def/measure/count> 1.0E3;
<http://gss-data.org.uk/def/measure/gbp-total> 2.0E4;
...
Given an observations file and a columns configuration file, the cube-pipeline
can be run:
clojure -M:cli exec cube-pipeline --input-csv path/to/input.csv --dataset-name Dataset --dataset-slug dataset --column-config path/to/column-configuration.csv --base-uri http://example.com/ --output-file cube.ttl
The URI of the generated cube will have the form {base-uri}/data/{dataset-slug}
where dataset-slug
is the value of the parameter provided cube-pipeline
. The cube will have a title matching the dataset-name
parameter.
A qb:Observation
is generated for each row in the observations CSV data. The observation corresponding to the first row of the observations within the employment example is:
<http://example.com/data/employment/S12000039/2017-Q1/female/count> a qb:Observation .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> qb:dataSet <http://example.com/data/employment> .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> sdmx-dimension:refArea <http://statistics.data.gov.uk/id/statistical-geography/S12000039> .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> sdmx-dimension:refPeriod <http://reference.data.gov.uk/id/quarter/2017-Q1> .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> <http://statistics.gov.scot/def/dimension/gender> <http://statistics.gov.scot/def/concept/gender/female> .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> qb:measureType <http://statistics.gov.scot/def/measure/count> .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> sdmx-attribute:unitMeasure <http://statistics.gov.scot/def/concept/measure-units/people> .
<http://example.com/data/employment/S12000039/2017-Q1/female/count> <http://statistics.gov.scot/def/measure/count> 2.07E4 .
The URI of the observation is derived from the base-uri
parameter and the dimension and measure values for the observation. The properties linking the observation to the corresponding component values are constructed from
the property_template
URI template in the column specification file. Similarly the corresponding values are either literals based on the declared data type, or URIs derived from the value_template
column. To illustrate,
the value of the Gender
column for this observation in the observation data is Female
. The column specification for the Gender
column defines a property_template
of http://statistics.gov.scot/def/dimension/gender
and
a value_template
of http://statistics.gov.scot/def/concept/gender/{gender}
. Note the gender
column is referenced by its name
(not title
) within the value URI template. The column also defines a value_transformation
of slugize
so observation values are converted into URI slugs before being incorporated into the value template. These are combined to produce the statement
<http://example.com/data/employment/S12000039/2017-Q1/female/count> <http://statistics.gov.scot/def/dimension/gender> <http://statistics.gov.scot/def/concept/gender/female> .
shown above. If a codelist has been generated by the codelist-pipeline
, care must be taken to ensure the value_template
for the associated dimension matches the format of the URI for generated members.
table2qb
pipelines output RDF data which frequently uses URIs to identify resources. table2qb
allows some customisation in the way URIs are generated through URI templates and various transformation on
the input data. These are described below.
table2qb
allows some URIs to be parameterised by input data, such as the property_template
and value_template
of the data cube columns configuration file.
The format of these templates are defined by RFC 6570 - URI Templates however the majority of templates will use relatively basic
features. The most common usage is to parameterise URIs by the values of various columns e.g.
http://example/def/concept/gender/{gender}
This references the gender
column, which should be defined within the columns configuration. Note referencing columns within URI templates is done by the column name
and not its title
(i.e. gender
instead of Gender
).
URIs are frequently used to identify linked data resources and table2qb
generates URIs in various places for this purpose. URIs place restrictions on the permissible characters within each component, and users
have additional expectations around the conventions used when building URIs. Since URIs components can be constructed from free-form text, table2qb
applies various transforms to the input data before incorporating
them into URI templates.
Input text is converted into a 'slugged' version using the slugize
transformation. This transformation is defined as:
- Convert the input string to lower-case
- Replace any non alphabetical characters with a
-
character - Replace sequences of
-
with a single-
character - Remove any trailing
-
For example "Gender" will be converted to gender
, "Export and Import Activity" to export-and-import-activity
.
The unitize
transformation is defined as:
- Replace
£
characters withGBP
- Follow the
slugize
transformation
For example the text £ 10000
is converted into gbp-10000
.
The classize
transformation is defined as:
- Capitalise the first letter of each word
- Remove whitespace around words
For example the text "date of birth" is converted into DateOfBirth
.
Note this transformation is only used internally for generating some URIs and is not a valid value for the value_transformation
in the data cube columns configuration.
The propertize
transformation is defined as:
- Lower-case the first letter of the first word
- Upper-case the first letter of all other words
- Remove the whitespace around words
For example the test "date of birth" is converted into dateOfBirth
.
Note this transformation is only used internally for generating some URIs and is not a valid value for the value_transformation
in the data cube columns configuration.
Table2qb defines default conventions for the structure of URIs for generated resources, for example the default structure of a component URI was shown above:
{base-uri}/def/{component-type}/{slugged-label}
Each pipeline defines their own set of named URI templates which it uses when defining resources such as components and codelists. The default
URI templates can be displayed using the uris
task, e.g.
clojure -M:cli uris components-pipeline
This displays the default structure of the URIs defined within the components-pipeline
. Each entry has a name and template containing two types of variable - template
variables and CSVW variables.
Template variables have the form $(variable-name)
. These are not defined by the URI templates specification and are expanded by each pipeline based on its required
input parameters. The result of this exapansion should be a valid URI or URI template. As an example, the components-pipeline
defines the following default template
for the URIs of generated components:
:component-uri $(base-uri)/def/{component_type_slug}/{notation}
base-uri
is a required parameter for the components-pipeline
so the resulting :component-uri
template will be the result of replacing $(base-uri)
with the
provided base-uri
parameter value when invoking the pipeline e.g.
clojure -M:cli exec components-pipeline --base-uri http://my-linked-domain ...
will result in :component-uri
being expanded to http://my-linked-domain/def/{component_type_slug}/{notation}
.
CSVW variables have the form {variable_name}
and reference columns within the intermediate CSVW table2qb
uses to generate RDF. These support the operators
defined within the URI templates specification.
Users can override the default URI templates by providing their own definitions in an EDN file containing a map of name to template. For component-uri
example above
this file could be defined as:
custom_uris.edn
{:component-uri "$(base-uri)/id/components/{notation}"}
No all URIs for a pipeline need to be defined within the file, any not specified will use the default value. The effective URI templates for the pipeline
can be output by providing this file to the uris
task e.g.
clojure -M:cli uris components-pipeline custom_uris.edn
This displays the URI templates that would be used by the pipeline on execution if the specified URIs file was provided.
If you are happy with the resulting URI structure, you can run the pipeline by specifying the optional --uris
parameter e.g.
clojure -M:cli exec components-pipeline --base-uri http://example.com --uris custom_uris.edn ...
The component, codelist and cube pipelines are run independently but resources created in one may be referenced in another. Below are some areas where care should be taken to ensure URIs generated by one pipeline match those referenced by another.
When calling the cube-pipeline, you identify component-properties with the property_template
of the columns-configuration. This template needs to match either an existing component property (such as sdmx-dimension:sex) or a component that you've created with the components-pipeline.
The components-pipeline will, by default, create components with URIs of the form {base-uri}/def/(dimension|measure|attribute)/{component-slug}
where component-slug
is the slugized version of the component label. You can change this behaviour by customising the URI configuration of the :component-uri
.
For example, if you were to use the components-pipeline to create a component with the Label
of "SITC Section"
you would need to ensure that any columns configured to use this component in the cube-pipeline had the property-template http://example.com/def/dimension/sitc-section
.
When calling the cube-pipeline, you identify dimension-values (typically codes) with the value_template
of dimensions in the columns-configuration. This template needs to match either an existing code (such as sdmx-code:sex-T) or a code that you've created with the codelist-pipeline.
The codelist-pipeline will, by default, create codes with URIs of the form {base-uri}/def/concept/{codelist-slug}/{notation}
where the codelist-slug
is provided as an argument when running the pipeline, and notation
is either provided as a column or (if missing) derived by slugizing the label
. This can be customised by changing the :code-uri
template.
You can match these URIs by configuring a value_template
like {base-uri}/def/concept/{codelist-slug}/{column_name}
, where column_name
refers to the value of the column's name
field (e.g. sex
). If you've not provided a notation
column to the codelist-pipeline, these values will need to be slugized. You can set the column's value_transformation
to slugize
to do this within table2qb.
For example, if you were to use the codelist-pipeline to create a code with a label
of "0 Food and Live Animals"
, you would need to ensure that any columns configured to use this code in the cube-pipeline had a name
with sitc_section
, a value-template
with http://example.com/def/concept/sitc-sections/{sitc_section}
and a value_transformation
with slugize
.
You can also specify the qb:codeList
using the Codelist
field in the component-pipeline
. This ought to correspond to an existing codelist (such as sdmx-code:sex or one you've created with the codelist-pipeline.
The codelist-pipeline will, by deafult, create codelists with URIs of the form {base-uri}/def/concept-scheme/{codelist-slug}
where codelist-slug
is provided as an argument when running the pipeline. This can be customised by changing the :codelist-uri
template.
For example, if you were to use the codelist-pipeline to create a codelist with the codelist-slug
of sitc-sections
you could call the components-pipeline
with a codelist
of http://example.com/def/concept-scheme/sitc-sections
.
You only need to worry about this if you're customising the used-codes codelist URI.
The cube-pipeline generates an skos:Collection
enumerating the codes that are used in each dimension. We create the skos:Collection
itself and then populate it with skos:member
triples in different parts of the pipeline.
The collection is created in the context of the csvw files that have one row per component. The template for the URI is configured with used-codes-codelist-uri-from-component
defaulting to $(base-uri)/data/$(dataset-slug)/codes-used/{component_slug}
.
The collection is populated in the context of the csvw files that have one row per observation. The template for the URI is configured with used-codes-codelist-uri-from-observation
defaulting to $(base-uri)/data/$(dataset-slug)/codes-used/{_name}
(note that _name
refers to the columns name).
If you customise one or other of these templates, you need to ensure that they match.
The pipelines used to define data cubes are run independently and table2qb
makes no attempt to validate that the various elements are defined and referenced consistently. For example, users
must ensure that the codelist URI for a component matches the one generated for the concept scheme created by the codelist-pipeline
. It is therefore recommended that data cubes are validated
for consistency after they have been generated. One such tool for validating RDF data is rdf-validator. This supports validating that a generated
cube conforms to the RDF data cube specification.