Skip to content

Data and Values

Damion Dooley edited this page Oct 22, 2018 · 99 revisions

Introduction

OBI has adopted IAO's Information Content Entity (ICE), very generally defined as a type of entity which bears information about something1. Any ICE class may have is about object relations to other entities which define its aboutness, for example, a "minimal inhibitory concentration is about some dose response curve"; an "age since planting measurement datum is about some Spermatophyta" (among other things). If the target of an is about relation is a quality then one may see an is about sub-property called is quality measurement of used, for example "length measurement datum is quality measurement of some length ".

The Information Content Entity data item class holds singular entities or collections of entities that specifically record inputs (parameters) or outputs (measurement/prediction/transformation datums) of processes/equipment/human interaction. To establish some constraints on what a singular data item (datum) can have for a value, OBI introduces an ICE subclass called value specification (VS) which can express those constraints in axioms (for example string length, pertinent units, or valid categorical choices). An instance of a value specification can have a has specified value data property that holds its literal value.

[diagram]

OBI also has a measurement datum class intended to name the datum outputs of an assay, and their contextual semantics while avoiding the value specification level of detail. Measurement datums and value specifications are independent classes, with measurement datums (or any ICE term) allowed a parallel value specification via the has value specification object property. (Past discussion on these two classes is at #870, #945 and #833). Thus an age measurement datum would be linked to a decimal number value specification which could detail the unit - year, month, day etc. of the measure.

Assays that result in a particular value specification, such as age in years, may be achieving this by contextually different measurement datums, for example the age since planting measurement datum could have more specific assays that calculate or estimate by tree ring count, carbon 14 analysis, planting date, height of species etc. A measurement datum of course is enriched by output of relations to the process(es) that generated it, and in that case, is also provided with a context that has implications for what it is about.

A value specification's primary aboutness is expressed using the specifies value of object relation. For example "mass value specification subClassOf : specifies value of some mass". In the case where a value specification is about the conjunction of a few different things, the aboutness target can be a precomposed term or instance of those components, with one component providing the primary type of the measure. For example "eye color" is primarily about color - and so limited to ways that can be reported, but secondarily about the body part being observed, and finally - in the instance, references a particular organism being observed. An example value specification instance: "_vs1 has specified value some color and 'is quality of' eye and 'is about' Patient123".

[diagram]

However, the defining characteristic of an Information Content Entity is that it is 'about' something. Thus, the scope of the OBI representation of data is to capture the details and characteristics of the information, rather than the thing that it describes. This is a crucial scoping step in developing representations in OBI.

Basic Implementation Issues: RDF, OWL, etc.

OWL inherits most of RDF's ability to specify XML string, numeric, datetime, and URI datatype values as data properties of an entity, and can compare data properties across entities (see here and here). OWL can also be used to specify constraints on string value length and content, and can specify numeric bounds on numbers. OBI currently focuses on reuse of RDF/XML datatypes to capture experimental data. Those who need further functionality may find other datatype representations useful (e.g. here).

In addition to reasoning prowess, using an OWL ontology to detail types of assay data - parameters, measurables, independent and dependent variables - will encourage standardization of their usage, enable experimental reproducibility, and facilitate data exchange and conversion.

Value specifications vs data properties

One advantage of having value specifications is that they reduce the need for a plethora of data properties. Rather than establish a 'has age' data property, we express a value specification about age. Both hold a value, but the latter allows us to focus on defining the semantics of the quality 'age' and its subclasses - 'age since planting' etc. (In this view a data property is analogous to a kind of compressed and semantically opaque value specification because its semantic detail is limited to data property attributes.)

[diagram]

Data Types

Here a handful of the primitive datatypes from RDF which are used in OWL are discussed. The possibility of user defined datatypes is avoided in favour of using enhanced value specifications to do the same work. Some examples are expressed directly in OBI, while others can be constructed in an application ontology that draws on OBI components. Recognizing that OWL isn't suitable for doing all types of validation, we have shown how value specifications can be enhanced with basic numeric range and string content restrictions.

Each of the datatypes below is described for use in the singular case; a collection of datums of a given data type is called a data set, which if numeric is amenable to statistical calculations, like a numeric spreadsheet column. It is appropriate to connect value specification instances to such a data set using the member of relation.

String

An OWL data property can hold a string as a plain literal with an optional language tag (see here ). This enables constraints on string length and its contents (by way of regular expressions).

For example a US Zip Code is a string of 5 digits (stored as a string to anticipate compatibility with its Zip+4 extension). One could construct the following representation:

Class: 'postal code specification'
    subClassOf 'value specification'
    subClassOf 'has specified value' only xsd:string[pattern "[0-9A-Za-z \-]{2,10}"]

Class: 'ZIP code specification'
    subClassOf 'postal code specification'
    subClassOf 'specifies value of' some ('postal code' and 'is about' some (site and 'located in' some 'United States of America')
    subClassOf 'has specified value' only xsd:string[pattern "[0-9]{5}"]

[diagram]

String length constraints can be set via "length", "minLength" and "maxLength" parameters, e.g. "xsd:string[length "5"^^xsd:integer]. A "pattern" parameter supports regular expression syntax to some extent, allowing "[0-9] [a-z] [A-Z] . ? * + {m,n}" components. Thus we can express fairly well-validated email addresses:

Class: 'email address specification'
    subClassOf 'value specification'
    subClassOf 'specifies value of' only 'email address' 
    subClassOf 'has specified value' only xsd:string[pattern "[A-Za-z0-9]+([_.\-][A-Za-z0-9]+)*\@[A-Za-z0-9]+([.\-][A-Za-z0-9]+){1,3}"]

Note one quirk: In pattern matching, the "@" character must be escaped or else the remainder of test string is ignored (i.e. "@" is interpreted as a language facet addition to the string). Also more work is required to cover possible validation of international / UTF-8 strings.

Categorical

A categorical value specification is a flat list or hierarchic tree structure containing a finite number of pre-determined choices. Here we provide for choices whose values are either xsd:string or xsd:anyURI references to ontology terms.

Categorical string choice

If a string must conform to a smaller set of choices, and nothing more needs to be axiomatized about each choice, then this can be accomplished with a value specification that is both string and categorical. The value specification has a 'has specified value' component which uses a regular expression to enumerate the permitted strings. Note that in this approach one cannot easily provide other information (label, description) about choice in a user interface.

For example, an "E-coli K antigen value specification" can be represented as:

Class: 'E-coli K antigen value specification'
    subClassOf 'categorical value specification'
    subClassOf 'specifies value of' only 'K antigen'
    subClassOf 'has specified value' only xsd:string[pattern "K(1|2a|2ac|3|4|5|6|7|8|9|10|11|12|13|14|15|16|18a|18ab|19|20|22|23|24|26|27|28|29|30|31|34|37|39|40|41|42|43|44|45|46|47|49|50|51|52|53|54|56|96|55|74|82|84|85ab|85ac|87|92|93|95|97|98|100|101|102|103|X104|X105|X106)"]]

[diagram]

This allows a reasoner to raise the unsatisfiable alarm when an instance of E-coli K antigen value specification has specified value 'K17a'.

One can potentially leave the has specified value axiom out, in which case validation enforcement would need to occur outside the OWL reasoning context.

Categorical ontology term choice

Categorical choice lists or trees of ontology terms (e.g. of organism taxonomy, of disease, etc.) essentially have an xsd:anyURI datatype since a selection is an ontology URI. The aim here is to point to existing ontology class or instance identifiers within one's application ontology and/or imported from 3rd party ontologies as selections for a categorical variable. However, some complications arise which the following example will explore. We could try to capture a handedness quality with:

Class: 'handedness value specification'
    subClassOf 'categorical value specification'
    subClassOf 'has specified value' only handedness 

However, this is not permitted in OWL since has specified value data property can only have a literal on the right side. The target could be expressed simply as "has specified value only xsd:anyURI" but this then requires some other mechanism for validating categorical values. Instead, lets reformulate this using specifies value of:

Class: 'handedness value specification'
    subClassOf 'categorical value specification'
    subClassOf 'specifies value of' only handedness

Now an instance of handedness value specification can have a specifies value of axiom pointing to a handedness class instance. Awkwardly, this requires all handedness selections to be "punned" since they can't be referenced directly as classes. In other words an individual needs to be created to mirror each categorical choice, so for example classes for left handedness, right handedness, ambidextrous handedness all need mirrored individuals - and in this case these are not native to the PATO ontology that the classes originate from. (Punning is accomplished manually in Protege by copying an existing class URI into the "Create a new Named individual" form, with the "new entity options ..." set to expect a user supplied name. This preserves the same identifier for both class and individual).

An underlying issue under discussion is about the most appropriate location - categorical measurement datum or categorical value specification - for enumerating categorical choices. Previously OBI has focused on categorical measurement datum, with a has category label object property that links to a set or class of permissible terms (as shown in OBI's existing handedness value specification example). A categorical measurement datum instance then points to a choice using has category label in the same way that the 'specifies value of' object property is used. Both relations inextricably refer to what they are about.

In a different approach, an OBI example using categorical value specification focuses on describing a tumor grading standard histologic grade according to AJCC 7th edition. Here the class has individuals which are each interpreted as grades, and which could potentially be augmented with data properties that detail their assessment differentiae. This approach is suited to cases where selections are not already established (and would not be in the future) as ontology classes with their own hierarchic context.

Boolean

Under discussion is the formalization of a "boolean value specification" datatype that pertains to the presence or absence of a quality or categorical entity. Essentially any quality taken on its own can be treated as a boolean variable. The information that an animal is characterized as a neonate, for example may be the focus of interest in a study even if a more comprehensive categorical value specification of its developmental stage could have been posed as a Likert scale.

Class: 'neonate value specification'
    subClassOf 'value specification'
    subClassOf 'has specified value' only xsd:boolean
    subClassOf 'specifies value of' only 'neonate' 

Each categorical value specification choice instance can potentially be interpretable as a boolean too.

Ordinal

Numeric

Currently all numeric value specifications are handled under the scalar value specification term, which implies that each must have a unit as well.

The xml:decimal datatype forms the general basis of more specific integer and float datatypes; numeric conversion appears to be smooth between these types. Any number type can be paired with a unit as described below.

OBI currently does not provide functionality for dealing with numeric precision or error range.

Decimal

Here the pH acidity scale is effectively characterized as a decimal between 0.0 and 14.0:

Class 'ph value specification'
    subClassOf 'scalar value specification'
    subClassOf 'has measurement unit label' only 'pH' 
    subClassOf 'specifies value of' only 'pH measurement'
    subClassOf 'has specified value' only xsd:decimal[ >=0, <=14 ]))

Note that the Protege axiom editor can be very fussy about exactly how the >,>=,<,<= comparators are positioned with spaces with respect to brackets and numbers.

Integer

Some variables are inherently integers - countable things that can't meaningfully have fractions except as intermediate calculations (quantities of water can be described in decimal to handle portions like 1.5 cups, while basepairs are not meaningful as fractions. Use xsd:integer where rounding during comparison won't be an issue.

Class 'MIC diffusion measurement specification'
    subClassOf 'scalar value specification'
    subClassOf 'has measurement unit label' only 'millimeter' 
    subClassOf 'specifies value of' only 'MIC value'
    subClassOf 'has specified value' only xsd:integer[ >5 ,< 100]

(OWL actually provides access to further subclasses of integer such as xsd:positiveInteger, but OBI does not have a matching granularity of value specification classes.)

Float

Class 'MIC dilution measurement specification'
    subClassOf 'scalar value specification'
    subClassOf 'has measurement unit label' only ('milligram per liter' or 'microgram per milliliter')
    subClassOf 'specifies value of' only 'MIC value'
    subClassOf 'has specified value' only xsd:float[ >=0.01f ,<= 2048.0f]

Units

OBI uses the has measurement unit label relation to pair numeric scalar parameters with related units. The Units of Measurement Ontology (UO) is the default unit ontology the OBOFoundry community uses, although there are other options2,3. It is left to a unit ontology to express the base units of the International System of Units, as well as compound units that have numerators and denominators sufficient for a problem space.

A value specification can select at a general level all the permissible units which underlying value specifications and their instances must conform to.

Units extend to countable things like nucleotide 'basepairs' and potentially even 'oranges' or 'fruit' etc. In this respect they indicate the aboutness of the value specification.

Duration

A duration is a difference in time calculated from an interval of two time points. (Semantically the interval is about those points and the events they mark). Value specifications for date and time durations or intervals are generally handled by decimal value specifications with one or more time units attached to them. This allows for decimal fraction amounts, e.g. 2.5 days. An 'age since birth' value specification could be:

Class: 'age since birth value specification'
    subClassOf 'scalar value specification'
    subClassOf 'has specified value' only xsd:decimal
    subClassOf 'specifies value of' some 'age since birth'
    subClassOf 'has measurement unit label' only (year or month or day or hour)

Datetime

Of XML's native date/time datatypes, OWL has currently adopted xsd:date, xsd:datetime (format [-]CCYY-MM-DDThh:mm:ss.sss[Z|(+|-)hh:mm] according to the ISO 8601 standard) and xsd:dateTimeStamp (format CCYY-MM-DDThh:mm:ss.sss(Z|(+|-)hh:mm), i.e. time zone required) into its reasoning specification. A Gregorian calendar 24 hour clock instant of time is used, and will be compared down to the second and timezone offset for xsd:dateTime/Stamp formats.

Class: 'hospital admission date specification'
    subClassOf 'scalar value specification'
    subClassOf 'has specified value' only xsd:date
    subClassOf 'specifies value of' only 'hospital admission date'

Often a need for date obfuscation arises when dealing with confidential data points. Pairing a unit such as year, month, day, hour etc. can convey the semantic granularity of the given xsd:datetime but won't have an effect on reasoner equality test, so the remainder of the datetime components need to be the same (zero'ed out for example) in order for an equality constraint to succeed. This could be done as a pre-processing step.

If a more complex model of date/time is required, the "Time Ontology in OWL"4,5 may suffice.

Missing values and other metadata

Data sources likely have a variety of ways to mark missing values. A food database example: “When the content of a food for a component is not known, a hyphen stands in place of the number. It is important for users to take into account these missing values and not to consider them as zero”6. Currently, a simple way to express this is to have an instance of a value specification, but no 'has specified value' data property for it.

Other metadata may need to be marked e.g. how to deal with: “In some cases, a component is detected in the food matrix, but it cannot be quantified precisely. The analytical result can therefore be considered as ‘trace’.” Another case is where a data item exists but has been obfuscated for privacy reasons. OBI does not currently have a metadata standard that addresses these cases.


References:


1The ability of an ICE to bear information depends on the coding scheme and medium it inheres in, hence it is a generically dependent continuant.

2https://github.com/HajoRijgersberg/OM

3http://qudt.org/

4https://www.w3.org/2001/sw/BestPractices/OEP/Time-Ontology

5https://www.w3.org/TR/owl-time/

6https://ciqual.anses.fr/cms/sites/default/files/inline-files/TableCiqual2017_XML_docENG.pdf