Provide guidelines on use of metadata record identifiers #389

mih · 2023-07-20T06:19:52Z

Metadata homogenization is a key challenge. It is made astronomically easier, if one and the same thing being described by two extractors is identified using the exact same identifier. AFAICS metalad provides no guidelines on how to achieve that. Only this issue:

Metadata identifier concept #30

Within the context of https://github.com/psychoinformatics-de/datalad-tabby and https://github.com/datalad/datalad-registry these things are also relevant and are being discussed. Examples:

The text was updated successfully, but these errors were encountered:

mih · 2023-07-20T14:26:49Z

I think the metadata extractor base class must provide methods to return valid JSON-LD @id values for

dataset concept(all versions)
dataset version
subdataset concept(all versions)
file version
file-in-dataset concept(all versions)

We do not need

subdataset version: this is the same as dataset version
file-in-dataset version: this is the same as file version (i.e. a content identifier, like an annex key)

mih · 2023-07-21T09:05:27Z

I believe I reduced the identifier concept to the minimum complexity. We would need:

a dataset concept identifier: applies to all versions of a dataset
a dataset part concept identifier: applies to all versions of a dataset part, and practically identifies a dataset component by name/location within the namespace of a particular dataset (given by a concept identifier)
a content (version) identifier: applies to any content (version) regardless of the type.

Here is a JSON-LD playground link that shows a fully defined record, with no blank node identifiers for a dataset version with two files and one subdataset. The playground also defines an example JSON-LD frame that could be used to retrieve a "plain" list of files from such a record.

Here is the record, explanations are given below:

{
  "@context": {
    "dcterms": "https://purl.org/dc/terms/",
    "dlds": "https://dx.datalag.org/dataset/",
    "dldspart": "https://dx.datalad.org/dataset-part/",
    "dlcontent": "https://dx.datalad.org/content/",
    "schema": "https://schema.org",
    "relpath": "dcterms:identifier",
    "isVersionOf": "dcterms:isVersionOf",
    "hasPart": "dcterms:hasPart"
  },
  "@id": "dlcontent:8646787c089052c639f9f477560c6d16b1f4314d",
  "@type": "schema:Dataset",
  "dcterms:identifier": "mydataset",
  "isVersionOf": "dlds:5604ef1f-377f-436a-a0e4-38257c44473c",
  "hasPart": [
    {
      "@id": "dlcontent:MD5E-s0--d41d8cd98f00b204e9800998ecf8427e",
      "@type": "schema:DigitalDocument",
      "isVersionOf": {
        "@id": "dldspart:4f32c4d7-dae7-58c5-8786-b60d2ebc7826",
        "relpath": "myfile"
      }
    },
    {
      "@id": "dlcontent:MD5E-s190--21f2d4006a8b6bc1a22f2e885d3fbc3a.txt",
      "@type": "schema:DigitalDocument",
      "isVersionOf": {
        "@id": "dldspart:447fbbf7-f732-5fdd-a22d-b3df3bac3e38",
        "relpath": "data/pipe.txt"
      }
    },
    {
      "@id": "dlcontent:ffdbd35dd78986fd3b6d069ca6669f90399b75da",
      "@type": "schema:Dataset",
      "isVersionOf": {
        "@id": "dldspart:ab66dff3-73bf-5aab-bf30-e8cc3f4d7e90",
        "relpath": "sources/myinputs"
      }
    }
  ]
}

The three identifier types are represented in the context definitions: dlds, dldspart, and dlcontent. Each is defined by a possible registry API endpoint (but this is not required for the IRI of such a JSON-LD document).

The basic principles of this document are:

structurally the versioned entity always wraps the unversioned one
versioned entities use dlcontent identifiers (which could be annex keys or gitshas or something similar)
unversioned/concept entities use UUID identifiers. These are either given in a dataset (ie. the datalad dataset ID), or generated from it. For example, the UUID in dldspart:447fbbf7-f732-5fdd-a22d-b3df3bac3e38 is generated via
```
uuid.uuid5(
  uuid.NAMESPACE_URL, 
  'https://dx.datalag.org/dataset/5604ef1f-377f-436a-a0e4-38257c44473c/data/pipe.txt',
)
```
where 5604ef1f-377f-436a-a0e4-38257c44473c is the datalad dataset D of the containing dataset.

(NB: the property name relpath and it definition as dcterms:identifier are not meant to be anyhow final. The specific semantics of this choice are not important for the identifier concept or the structure of the documents that use them)

Post-publication thoughts:

there should be a dcterms:identifier property for dlds:5604ef1f-377f-436a-a0e4-38257c44473c. it should say:

"isVersionOf": {
  "@id": "dlds:5604ef1f-377f-436a-a0e4-38257c44473c",
  "dcterms:identifier": "5604ef1f-377f-436a-a0e4-38257c44473c"
}

mih · 2023-07-27T11:36:49Z

http://docs.datalad.org/projects/tabby/en/latest/conventions/tby-ds1.html has a demo for a non-DataLad dataset description that is compatible with this approach, albeit using a slight better semantic setup (see for example the linkage of a POSIX path as a name to a versioned file entity).

mih mentioned this issue Jul 29, 2023

Dataset serialization to tabby-records, and deserialization of tabby-records to datasets psychoinformatics-de/datalad-tabby#48

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide guidelines on use of metadata record identifiers #389

Provide guidelines on use of metadata record identifiers #389

mih commented Jul 20, 2023

mih commented Jul 20, 2023 •

edited

Loading

mih commented Jul 21, 2023 •

edited

Loading

mih commented Jul 27, 2023

Provide guidelines on use of metadata record identifiers #389

Provide guidelines on use of metadata record identifiers #389

Comments

mih commented Jul 20, 2023

mih commented Jul 20, 2023 • edited Loading

mih commented Jul 21, 2023 • edited Loading

mih commented Jul 27, 2023

mih commented Jul 20, 2023 •

edited

Loading

mih commented Jul 21, 2023 •

edited

Loading