Document | Description |
---|---|
This README | An overview of the datastore including a brief summary of the data model. |
db-design-principles | A summary of the design principles, conventions and decisions made when designing and building the datastore. |
db-schema-auth | auth schema definitions - tables, views and functions. |
db-schema-reg | reg schema definitions - tables, views and functions. |
db-schema-data | data schema definitions - tables, views and functions. |
db-schema-def | cm schema definitions - tables, views and functions. |
db-schema-voc | voc schema definitions - tables, views and functions. |
db-schema-sys | sys schema definitions - tables, views and functions. |
The database is designed to store information about any environmental entity, and the sampling activities used to observe and describe them. The main goals are flexibility (allowing the definition of new classes or properties as required) and accuracy (classes and properties are fully described with metadata and uncertainty information).
The database comprises five main schema (Figure 1): auth
, reg
, cm
, voc
, data
and sys
.
auth
holds the tables, functions and views supporting the PostgreSQL row level security (RLS) policies that manage access to records by client users.reg
holds the system registry, organising and managing access to sets of data in thecm
,voc
, anddata
schemacm
holds the type definitions for classes and properties stored in thedata
schema.voc
holds the controlled vocabularies used in thedata
schema.data
is an entity-attribute-value (see db-design-principles) database that holds the soil data managed and/or served by the EDR.sys
is a schema for system tables and functions that are used by elements in all other schema.build
is a schema for useful functions that help automate the construction of the database - e.g. creating all necessaryperiods
extension system versioning tables.
Figure 1. Overview of EDR schema and their base tables.
There is a working
schema for temporary tables and views - e.g. to support data maintenance and
loading.
A brief summary of the database is presented in this document. For clarity (?!) full definitions for each schema are maintained in separate documents. Across all mark-down documents the definition table structure is largely self-explanatory but some unusual conventions for the content are as follows:
- The key columns refers to whether a column is part of a primary key (PK) or foreign key (FK). So-called weak foreign keys are shown with a lower case fk - these should have records in the target table, but may not (probably for no good reason) so referential integrity is not enforced.
- If a foreign key spans two columns the FK value will be given a lower case letter suffix in brackets (e.g. FK(a)) and be reused by all participating columns.
- The Null column shows if a column or function parameter is required (N for not null) or optional (Y for null).
See db-design-principles for an overview of database conventions, and explanations of why and how certain things were implemented as they were.
Data types and functions for the manipulation and querying of spatial data are implemented using the
PostGIS
extension (https://postgis.net/).
PostGIS functions and registers are stored in the public
schema. Functions that support
topological operations are stored in the topology
schema.
The sys
and build
schemas hold tables, views and functions used throughout the database. These support
essential database operations relating to the generation of resource identifiers, and the generation of
DDL to create objects that manage transaction history and user access.
It also includes a set of helper functions and views to automate the bulk creation of certain database objects, such as:
- system columns
- system versioning tables
- row level security policies
- views for API middleware (experimental)
The auth
schema holds tables, types, views and functions that support the implementation of
row level security (RLS) in the
database. Each table in the database will allow select, update, insert and delete operators to users
that also provide a legitimate JSON Web Token (JWT) that includes an operator ID that can be checked
against the access level specified in the relevant authorisation table. This means a client can
access the database using a client specific login role that have been given membership of the
edr_jwt
group role, and individual users can be distinguished by operator specified in the JWT
claims. This assumes all non-administrative access to the database is via APIs that authenticate
users with the MWLR Authentication Service.
- Operators are registered in
auth.operator
. - Changes to entities, and their properties and associations, are authorised according access rights granted on their datasets.
- Changes to datasets and registers, and their properties and associations, are authorised according access rights granted on their registers.
- An anonymous operator allowing read access to open datasets has been placed in
auth.operator
. This user may not be granted admin or edit access. - Rights granted to a parent or ancestor register/dataset are inherited but descendants. Inherited rights may be overridden by denying a user access - this denial then propagates to all descendants and cannot itself be overridden.
- For each dataset (register or dataset), the various rights are summarised into arrays of
operators that enforce inherited and denied access in the view
auth.dataset__authorisation
. - Data administrators and maintainers can be granted the membership of group roles that support
admin (
edr_admin
), edit (edr_edit
), and read (edr_read
) tasks. The roles bypass RLS policies.
With time, access constraints will extend to the suppression of 'restricted attributes'.
Due to their importance to the management of data, access rights are defined using a PostgreSQL enumeration type:
auth_access
(admin, edit, read, deny)
Edit types are only applicable to:
- entities, entity attributes and entity associations
- class, attribute, and association definitions
- concepts and their properties
All users should have read access to datasets. Otherwise, as datasets are inherently administrative structures, edit access on these is restricted to operators with admin authorisation.
By default, edit and read of access to entities is restricted.
Anything can be subject to read access.
RLS policies use the function auth.operator_access
to check the operator (by reading the JWT)
against the requested access role (as implied by the SQL operation) and dataset.
The following table summarises the allowable operations on a table type* for a given authorisation**.
Authorisation | register |
dataset |
entity |
---|---|---|---|
admin | S, I, U, D | S, I, U, D | S, I, U, D |
edit | S | S | S, I, U, D |
read | S | S | S |
deny | revoke | revoke | revoke |
* The table type encompasses associated proeprty tables (e.g.
entity
also includesentity__assertion
and so on).
** S: select; I: insert; U: update; D: delete
Figure 2. Summary of the
auth
schema authorisation tables.
Datasets (class models, concept schemes and collections, and environmental datasets) are managed by
registry tables (Figure 3). The registry is made up of registers that define how the items in
the register (class models, concept schemes and collections, and datasets) are governed and
published. They define ownership, authorise editing rights, and licensing constraints on
publication and use. Registers also include metadata to support discovery and publication (e.g.
providing DOIs for datasets). Datasets managed in datasets
shall be organised for publication as
registers
.
Individual entities (and concepts, class and so on) can be organised into registers for publication.
Figure 3. Summary of the
reg
schema registry tables.
The definition and vocabulary tables are the glue that holds the EAV structure together. They define:
- the class and property types created as (effectively) the key in the EAV key-value-pairs in the
data
schema - the concepts for controlled vocabularies that provide values in the key-value-pairs in the
data
schema
Note that class models are manged in a separate database, this database will hold copies of class models as required.
Class models (Figure 4a) define the class and property (attribute and association) types for a
given area of interest (e.g. soil). The domain of a property type is the class (or classes) it
describes, while the range of a property is the type of its value. Literal types - e.g. strings
and integers - a defined using system data types. Class and property types may be specialised into
sub-types - for example the property type relatedSamplingEntity
can be specialised into
relatedSoilSample
and relatedSoilSite
.
All class models, class and property types are mapped to a set of strictly controlled
system types (Figure 4b) by the system__type_id
referencing the relevant system type table. These
tables include full definitions of each system type.
cm.system__type__class_model
: covers the classical modelling paradigms (conceptual, logical and physical).cm.system__type__class
: define a set of fundamental types taken from the RDF, PROV and SOSA ontologies.cm.system__type__association
(property): fundamental association types.cm.system__type__attribute
(property): fundamental attribute types.cm.system__type__data_type
: fundamental data types for attribute values (derived from UML classes).
System type tables hold ids that are used in views, functions, triggers and so on that manage
content in the data
schema. As such they can only be changed by the system admin user and system
type ids should only be changed as a last resort and where the system type hasn't been applied to a
class or property type.
The system type of a class or property should not be changed once it has been used as it can have a
significant impact on how records are processed and stored. To protect from disruption, update
triggers on cm
tables check for the usage of a class or property, restricting change to an id if
it is linked to a used class/property.
Possibly over-engineering things here -
on update restrict
may be more appropriate. And simpler.
The model for class models is based on the Resource Description Framework Schema (RDFS).
Figure 4. a) Summary of the
cm
schema class model tables.
b) Class models, classes, properties and data types with their specialised system types shown as a conceptual class model. Definitions of each system type can be found in the system type table linked to the instantiating (implementing) cm schema table shown in the diagram.
For controlled vocabularies (Figure 5) concepts (terms) are organised into schemes. Concepts may be related to one another (e.g. as simple hierarchical taxonomies, or equivalence mappings between schemes) using semantic relationships. Concepts and schemes can be translated into multiple languages. Should concepts need the be aggregated for a particular purpose (say by combining two schemes, or sub-setting one, or both) a collection is created.
The model for vocabularies is based on the Simple Knowledge Organisation System (SKOS).
Figure 5. Summary of the
voc
schema vocabulary tables.
The data
schema (Figure 6) holds the actual soil survey, sample and laboratory data.
Figure 6. a) Summary of the
data
schema tables.
b) An example of how
data
tables link tocm
tables.
In the data
schema classes are implemented in tables holding entities or datasets:
entities
hold instances of domain and sampling entitiesdatasets
organiseentities
into datasets based on direct ownership by a team or user for a particular purpose (e.g. a field survey). Dataset access is managed through the registry.
Class tables have a simple structure - records identify the existence of an instance of class and
declare its type (soil, sample, dataset etc) using the class_id
column referencing the table
cm.class
.
All properties of a class are held in related attribute and association tables (see below).
Class tables may include properties prefixed with default_
. These allow a class to be located or
labelled but are not data in their own right. They can only be created by copying or merging data
from property tables.
Default values may be provided in related tables that extend the base table. For example when not all types have a meaningful default value.
Attribute tables hold attribute values for a class as key value pairs. They declare the type of
attribute (e.g. moisture content) using attribute_id
referencing the table cm.attribute
. Rules
for the data type, structure and acceptable values are held against the attribute definition. Values
are held in a JSON object in the value
column:
- the structure of the value is declared/validated by a JSON schema in the attribute range's
system__type__data_type.schema_json
- when the value is a controlled concept its value is a UUID referencing
cm.concept.id
- when the value is a binary object (e.g. a spectra file) its value is a UUID referencing
data.binary_object
Attributes may be asserted or observed:
- assertions are attributes whose values are assigned to an entity, for example a name, id or note
- observations are attributes whose values are estimated at some time using some procedure and the result has a degree of uncertainty. Some observations are made using an agent, perhaps electronic sensor (a measurement) or computer algorithm (prediction). The degree of uncertainty should be captured for measurements/predictions.
Most attribute values in this database are observations or measurements.
There are three base attribute tables based on these specific attribute system types (and their subtypes):
entity__assertion
: A simple attribute whose value is assertedentity__observation
: A simple observation with a procedure and result timeentity__measurement
: An extended observation, including agent and uncertainty metadata
Trigger functions on these tables check that the attributes are of the right system type when records are inserted.
For convenience, all three of these tables can be accessed and updated through a single view:
entity__attribute
. During inserts, this view uses instead of
triggers to send a record to the
right base table according to the attribute's system type.
Association tables hold references to other entities (e.g. where a horizon is composed of
constituent mottles). The type is a key referencing the table cm.association
. Rules for the
acceptable entity relationships are held against the association type definition.
As with class tables, associations have core types that are declared via their system__type_id
column. The association types are:
aggregation
- an association between two classes where the component class cannot exist independently of the source class. On delete a component must be deleted as well. As the aggregation table is a many-to-many table, the cascading delete will need to be managed through a trigger orinstead of
directive.relationship
- a simple association, or link, between classes. When an associated class is deleted, only the relationship to the other class is deleted.
There are two base association tables based on these specific association system types (and their subtypes):
entity__aggregation
: An aggregation of entitiesentity__relationship
: A simple relationship
Trigger functions on these tables check that the relationships are of the right system type when records are inserted.
For convenience, both of these tables can be accessed and updated through a single view:
entity__association
. During inserts, this view uses instead of
triggers to send a record to the
right base table according to the association's system type.
A set of human-friendly views is available to help inspect base tables without being confronted by
an ocean of UUIDs. The views have the name of the base table suffixed by __view
(e.g.
entity__view
).
The views are intended to help software and data engineers explore and understand the database, it is not anticipated that these will be used to deliver data to clients.
These views substitute UUIDs with the default label (or equivalent) of the related 'thing' (e.g. a class, attribute or entity identifier). Where sensible, associated classes or entities may be provided as arrays of default labels.