Define `@id` for dataset content: `File`(`Version`) and `Subdataset`(`Version`) #78

mih · 2023-07-19T07:49:23Z

Ping:

This is the content-focused companion of #76. We need concept-level, and version-level IDs (and possibly even distribution-level IDs) for files and subdatasets too.

In the following <dsid> would be a DataLad dataset UUID that represents the concept-level identifier of a dataset.

A file can be:

<dsid>/README: "the README of this dataset", i.e. the concept of a file with this name within the scope of that particular dataset
<checksum/annexid>: A particular content blob, which could is isVersionOf a "file concept" within a particular dataset

A distribution-level file description could be

the association of a location/download-method with a content blob -- however, this could also just be a plain property of that blob's metadata record

For subdatasets the situation appears to be slightly more complex. The technical vehicle of a submodule is composed of:

name: akin to schema:name
path: the mountpoint, and in some sense an ID component identifying a dataset as a subdataset within the scope of a superdataset
url: akin to schema:url
commit: akin to schema:version

So we always have a version-level description here. But the described version cannot be a sufficient identifier, because we need to describe the use of that dataset version as a subdataset (the same version can be used in many different superdatasets, with unique path values and possibly other properties).

At the very minimum, we need to reflect this with a unique @id that cannot just be a gitsha (or <dsid>/<gitsha>, see datalad/datalad-registry#217 (comment)).

We could use the same ID format as for files (after all, files in a dataset and subdatasets share the same namespace):

<dsid>/<subdataset-relpath>

and then attach properties, such as:

{
  "@type": "schema:Dataset",
  "version": "<gitsha>",
  "url": "<clone-url>",
  "sameAs": "<dataset-id-that-could-be-gitsha-or-different>",
  "isVersionOf": "<dataset-concept-id>"
}

The text was updated successfully, but these errors were encountered:

mih · 2023-07-27T10:58:59Z

This is very much in the context of describing datasets that are DataLad datasets. The main forum for that should be datalad/datalad-metalad#389

#101 also contains an example of an approach that does not require DataLad identifiers.

mih changed the title ~~Define @id for a File(Version)~~ Define @id for dataset content: File(Version) and Subdataset(Version) Jul 19, 2023

This was referenced Jul 19, 2023

Develop specification for file-level metadata #6

Closed

Provide guidelines on use of metadata record identifiers datalad/datalad-metalad#389

Open

mih mentioned this issue Jul 27, 2023

Documentation on describing datasets specifically #102

Open

8 tasks

mih closed this as completed Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define `@id` for dataset content: `File`(`Version`) and `Subdataset`(`Version`) #78

Define `@id` for dataset content: `File`(`Version`) and `Subdataset`(`Version`) #78

mih commented Jul 19, 2023 •

edited

Loading

mih commented Jul 27, 2023

Define @id for dataset content: File(Version) and Subdataset(Version) #78

Define @id for dataset content: File(Version) and Subdataset(Version) #78

Comments

mih commented Jul 19, 2023 • edited Loading

mih commented Jul 27, 2023

Define `@id` for dataset content: `File`(`Version`) and `Subdataset`(`Version`) #78

Define `@id` for dataset content: `File`(`Version`) and `Subdataset`(`Version`) #78

mih commented Jul 19, 2023 •

edited

Loading