Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define DataLad metadata model #56

Closed
mih opened this issue Jul 11, 2023 · 5 comments
Closed

Define DataLad metadata model #56

mih opened this issue Jul 11, 2023 · 5 comments

Comments

@mih
Copy link
Contributor

mih commented Jul 11, 2023

This is closely linked to #37 and #55.

We are in dire need to define a comprehensive (meta)data model that can capture the various basic entities we need to capture. Some of these are already covered in http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html, e.g. Dataset, DatasetVersion. Things like DataDistribution do not seem to have a clear applicability in the DataLad context.

What is critically needed is a model to describe a subdataset or Git submodule: A repository/branch state that is employed/imported/mounted in the context of a particular superdataset, at a particular location.

It seems necessary that this entity is/must be distinct from Dataset or DatasetVersion, because either one of those could not contain the necessary information, as they can be used in multiple context in bitidentical form.

Maybe there is an ontology of Git concepts already?

@christian-monch
Copy link
Contributor

We can take a look at W3C's Provenance ontology (https://www.w3.org/TR/prov-o/) for some ideas w.r.t. versioned dataset repositories, i.e. datalad-datasets, git-repos.

@jsheunis
Copy link
Contributor

Been searching for something that we could use and found a few interesting links:

They use Wikidata as a source for some git-related entities but I didn't find anything useful for submodule yet. They also use: https://www.semanticarts.com/gist/

@christian-monch
Copy link
Contributor

christian-monch commented Jul 14, 2023

Started the work on this in the metalad repo (https://github.com/datalad/datalad-metalad/tree/ld-metadata-model). A few remarks (taken from the first commit message in the branch):

The branch currently only adds a turtle-document that should grow into a schema that describes the internal metadata structure of metalad.

The current version is far from complete and rather naive, it does not yet take a large number of schemas/ontologies into account.

I did look at the HCLS-Dataset Descriptions and it seems to me that their DatasetVersion is probably too much geared to a manually provided version that has been assigned to a specific code-version. It includes concepts that are not easily translated in commit-based versions (which are used in metalad), e.g. "previousVersion". Therefore there is, for now, a class called
DatasetInstance, which refers to a specific commit in a DataLad dataset. Those are usually much more fine-grained than a version.

The git-layer is not modelled. That could be done by `https://github.com/justin2004/git_to_rdf (which @jsheunis pointed to).

The intention of the layer described here is to model the metalad specific metadata-elements. The git-related RDF content can be attached to a DatasetInstance.

@christian-monch
Copy link
Contributor

The rdf schema definition includes an example instance, i.e. ex:exampleInstance_1, to allow playing around with SPARQL.

The following query, for example, yields all metadata formats, i.e. extractor_names, that are present in the graph:

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix ex: <http://www.example.com/>
prefix dl: <http://datalad.org/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
Select ?format where {
   ?s dl:metadataFormat ?format .
}

The output would be something like:

format
------
metalad_core
studyminimeta

@mih
Copy link
Contributor Author

mih commented Jul 21, 2023

From my POV the essence of this issue is described and demo'ed in datalad/datalad-metalad#389 (comment)

Given this key nature of this topic that is not tabby-specific at all, I will close this issue here.

@mih mih closed this as completed Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants