Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define @id for a Dataset(Version) #76

Closed
Tracked by #102
mih opened this issue Jul 19, 2023 · 5 comments
Closed
Tracked by #102

Define @id for a Dataset(Version) #76

mih opened this issue Jul 19, 2023 · 5 comments

Comments

@mih
Copy link
Contributor

mih commented Jul 19, 2023

Replaces: datalad/datalad-metalad#380
Ping: datalad/datalad-registry#217

A tabby record will typically be a version-level description (in HCLS terms. However, this is not necessarily the case (without a version label, we would be missing an essential component, and it would instantly be a summary-level description.

Such a difference would not necessarily impact the type annotation. Both could be dcat:Dataset or https://schema.org/Dataset. It would, however, matter for crafting a valid @id.

We need to have a common approach for @id choice within datalad's metadata ecosystem to simplify homogenization and merges across metadata sources (see datalad/datalad-metalad#30 for other thoughts). Any approach to @id must not confuse the different description levels.

I posted some ideas in datalad/datalad-registry#217 (comment)

Concrete issues:

  • a metadata record may have one or more dataset DOIs on record. This could serve as @id. However, in general we will not be able to infer the nature of such a DOI (concept DOI covering all versions vs. version-specific DOI). Moreover, such a DOI may be specific to a particular download (distribution-level identifier). One and the same dataset (version) could be hosted in more than one data portal and receive different DOIs that all point to the exact same information at different locations.
  • a tabby metadata extractor would need to report at least two metadata records: the version-level description, and a concept-level description (the former linking the latter via isVersionOf
@christian-monch
Copy link
Contributor

One observation with regard to the HCLS Version Level Description (HCLS VLD), which I came across when working on the JSON-LD structure for the metalad_core-dataset extractor:

I am not sure that the HCLS VLD maps well onto commit-based versioning. The HCLS VLD concept seems to describe a selected state that was released. It seems more related to tags in datasets.

Just wanted to mention that, I am not proposing any change, because I think it can be justified to apply HCLS VLD to commit-based versions (although the cardinality-1 pav:previousVersions does not always make sense there). And it would definitely be easier for us because a dataset version level id could then just be something like:

https://dx.datalag.org/dataset/<uuid>@<commit-sha>

@mih
Copy link
Contributor Author

mih commented Jul 19, 2023

Can you give a concrete example where it would not map well?

pav:previousVersions with gitshas as version identifiers would work nicely, IMHO. There is no way to alter the history without also altering the identifiers (automatically), i.e. given any commit-version, the previous commit is always fixed/known. I do not understand when it "does not always make sense there".

https://dx.datalag.org/dataset/<uuid>@<commit-sha> is proposed in datalad/datalad-registry#217. However, I cannot convince myself that it is actually any better than https://dx.datalag.org/dataset/<commit-sha> (but certainly much longer). Do you have an argument in favor of the former over the latter?

@christian-monch
Copy link
Contributor

christian-monch commented Jul 19, 2023

Can you give a concrete example where it would not map well?

I thought about merge-commits. But that might not be relevant, because we can choose one of the parents.

https://dx.datalag.org/dataset/<uuid>@<commit-sha> is proposed in datalad/datalad-registry#217. However, I cannot convince myself that it is actually any better than https://dx.datalag.org/dataset/<commit-sha> (but certainly much longer). Do you have an argument in favor of the former over the latter?

I would actually also propose the latter, i.e. https://dx.datalag.org/dataset/<commit-sha>. That is actually similar to what is used in the studyminimeta-extractor-ouput (https://schema.datalad.org/datalad_dataset#<commit-sha>).

@mih
Copy link
Contributor Author

mih commented Jul 20, 2023

re https://dx.datalag.org/dataset/<commit-sha> vs https://dx.datalag.org/dataset/<uuid>@<commit-sha>

The strongest argument that I can find for going for gitsha-only is:

  • any extractor executed on any real-world "dataset" will always have a gitsha to work with
  • a DataLad UUID is standard in DataLad datasets, but not universally guaranteed
  • if we make a "concept" UUID as requirement for a version-level ID, we either exclude any plain Git(Annex)Repo, or we require a standard mechanism to generate a UUID

Not having a UUID component in the version-level ID avoids this complication, with no loss of functionality or precission.

mih added a commit that referenced this issue Jul 20, 2023
This established the absolute minimum necessary to distinguish
summary-level and version-level dataset descriptors.

The main metadata record is always considered to be a version-level
description. In addition, a summary level description is linked
via `dcterms:isVersionOf`. The summary-level descriptor itself
is bare-bone and only declares `hasVersion` with a backlink to the
version-level description.

Ping #76
@mih
Copy link
Contributor Author

mih commented Jul 27, 2023

This is very much in the context of describing datasets that are DataLad datasets. The main forum for that should be datalad/datalad-metalad#389

#101 also contains an example of an approach that does not require DataLad identifiers.

@mih mih closed this as completed Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants