Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define @id for dataset content: File(Version) and Subdataset(Version) #78

Closed
Tracked by #102
mih opened this issue Jul 19, 2023 · 1 comment
Closed
Tracked by #102

Comments

@mih
Copy link
Contributor

mih commented Jul 19, 2023

Ping:

This is the content-focused companion of #76. We need concept-level, and version-level IDs (and possibly even distribution-level IDs) for files and subdatasets too.

In the following <dsid> would be a DataLad dataset UUID that represents the concept-level identifier of a dataset.

A file can be:

  • <dsid>/README: "the README of this dataset", i.e. the concept of a file with this name within the scope of that particular dataset
  • <checksum/annexid>: A particular content blob, which could is isVersionOf a "file concept" within a particular dataset

A distribution-level file description could be

  • the association of a location/download-method with a content blob -- however, this could also just be a plain property of that blob's metadata record

For subdatasets the situation appears to be slightly more complex. The technical vehicle of a submodule is composed of:

  • name: akin to schema:name
  • path: the mountpoint, and in some sense an ID component identifying a dataset as a subdataset within the scope of a superdataset
  • url: akin to schema:url
  • commit: akin to schema:version

So we always have a version-level description here. But the described version cannot be a sufficient identifier, because we need to describe the use of that dataset version as a subdataset (the same version can be used in many different superdatasets, with unique path values and possibly other properties).

At the very minimum, we need to reflect this with a unique @id that cannot just be a gitsha (or <dsid>/<gitsha>, see datalad/datalad-registry#217 (comment)).

We could use the same ID format as for files (after all, files in a dataset and subdatasets share the same namespace):

<dsid>/<subdataset-relpath>

and then attach properties, such as:

{
  "@type": "schema:Dataset",
  "version": "<gitsha>",
  "url": "<clone-url>",
  "sameAs": "<dataset-id-that-could-be-gitsha-or-different>",
  "isVersionOf": "<dataset-concept-id>"
}
@mih mih changed the title Define @id for a File(Version) Define @id for dataset content: File(Version) and Subdataset(Version) Jul 19, 2023
@mih
Copy link
Contributor Author

mih commented Jul 27, 2023

This is very much in the context of describing datasets that are DataLad datasets. The main forum for that should be datalad/datalad-metalad#389

#101 also contains an example of an approach that does not require DataLad identifiers.

@mih mih closed this as completed Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant