Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide guidelines on use of metadata record identifiers #389

Open
mih opened this issue Jul 20, 2023 · 3 comments
Open

Provide guidelines on use of metadata record identifiers #389

mih opened this issue Jul 20, 2023 · 3 comments

Comments

@mih
Copy link
Member

mih commented Jul 20, 2023

Metadata homogenization is a key challenge. It is made astronomically easier, if one and the same thing being described by two extractors is identified using the exact same identifier. AFAICS metalad provides no guidelines on how to achieve that. Only this issue:

Within the context of https://github.com/psychoinformatics-de/datalad-tabby and https://github.com/datalad/datalad-registry these things are also relevant and are being discussed. Examples:

@mih
Copy link
Member Author

mih commented Jul 20, 2023

I think the metadata extractor base class must provide methods to return valid JSON-LD @id values for

  • dataset concept(all versions)
  • dataset version
  • subdataset concept(all versions)
  • file version
  • file-in-dataset concept(all versions)

We do not need

  • subdataset version: this is the same as dataset version
  • file-in-dataset version: this is the same as file version (i.e. a content identifier, like an annex key)

@mih
Copy link
Member Author

mih commented Jul 21, 2023

I believe I reduced the identifier concept to the minimum complexity. We would need:

  • a dataset concept identifier: applies to all versions of a dataset
  • a dataset part concept identifier: applies to all versions of a dataset part, and practically identifies a dataset component by name/location within the namespace of a particular dataset (given by a concept identifier)
  • a content (version) identifier: applies to any content (version) regardless of the type.

Here is a JSON-LD playground link that shows a fully defined record, with no blank node identifiers for a dataset version with two files and one subdataset. The playground also defines an example JSON-LD frame that could be used to retrieve a "plain" list of files from such a record.

Here is the record, explanations are given below:

{
  "@context": {
    "dcterms": "https://purl.org/dc/terms/",
    "dlds": "https://dx.datalag.org/dataset/",
    "dldspart": "https://dx.datalad.org/dataset-part/",
    "dlcontent": "https://dx.datalad.org/content/",
    "schema": "https://schema.org",
    "relpath": "dcterms:identifier",
    "isVersionOf": "dcterms:isVersionOf",
    "hasPart": "dcterms:hasPart"
  },
  "@id": "dlcontent:8646787c089052c639f9f477560c6d16b1f4314d",
  "@type": "schema:Dataset",
  "dcterms:identifier": "mydataset",
  "isVersionOf": "dlds:5604ef1f-377f-436a-a0e4-38257c44473c",
  "hasPart": [
    {
      "@id": "dlcontent:MD5E-s0--d41d8cd98f00b204e9800998ecf8427e",
      "@type": "schema:DigitalDocument",
      "isVersionOf": {
        "@id": "dldspart:4f32c4d7-dae7-58c5-8786-b60d2ebc7826",
        "relpath": "myfile"
      }
    },
    {
      "@id": "dlcontent:MD5E-s190--21f2d4006a8b6bc1a22f2e885d3fbc3a.txt",
      "@type": "schema:DigitalDocument",
      "isVersionOf": {
        "@id": "dldspart:447fbbf7-f732-5fdd-a22d-b3df3bac3e38",
        "relpath": "data/pipe.txt"
      }
    },
    {
      "@id": "dlcontent:ffdbd35dd78986fd3b6d069ca6669f90399b75da",
      "@type": "schema:Dataset",
      "isVersionOf": {
        "@id": "dldspart:ab66dff3-73bf-5aab-bf30-e8cc3f4d7e90",
        "relpath": "sources/myinputs"
      }
    }
  ]
}

The three identifier types are represented in the context definitions: dlds, dldspart, and dlcontent. Each is defined by a possible registry API endpoint (but this is not required for the IRI of such a JSON-LD document).

The basic principles of this document are:

  • structurally the versioned entity always wraps the unversioned one

  • versioned entities use dlcontent identifiers (which could be annex keys or gitshas or something similar)

  • unversioned/concept entities use UUID identifiers. These are either given in a dataset (ie. the datalad dataset ID), or generated from it. For example, the UUID in dldspart:447fbbf7-f732-5fdd-a22d-b3df3bac3e38 is generated via

    uuid.uuid5(
      uuid.NAMESPACE_URL, 
      'https://dx.datalag.org/dataset/5604ef1f-377f-436a-a0e4-38257c44473c/data/pipe.txt',
    )

    where 5604ef1f-377f-436a-a0e4-38257c44473c is the datalad dataset D of the containing dataset.

(NB: the property name relpath and it definition as dcterms:identifier are not meant to be anyhow final. The specific semantics of this choice are not important for the identifier concept or the structure of the documents that use them)

Post-publication thoughts:

  • there should be a dcterms:identifier property for dlds:5604ef1f-377f-436a-a0e4-38257c44473c. it should say:

    "isVersionOf": {
      "@id": "dlds:5604ef1f-377f-436a-a0e4-38257c44473c",
      "dcterms:identifier": "5604ef1f-377f-436a-a0e4-38257c44473c"
    }

@mih
Copy link
Member Author

mih commented Jul 27, 2023

http://docs.datalad.org/projects/tabby/en/latest/conventions/tby-ds1.html has a demo for a non-DataLad dataset description that is compatible with this approach, albeit using a slight better semantic setup (see for example the linkage of a POSIX path as a name to a versioned file entity).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant