Support self-description of a dataset with tabby #55

mih · 2023-07-11T08:13:13Z

The format should be usable for providing metadata on a dataset in its DataLad dataset form. Practically, this means that we need to put TSV, json, jsonld files in some standard location, and we need to have a metadata extractor (or other utility) that can amend such a manual record with auto-generated information on files (and possibly other information).

Two consumption scenarios are of interest:

full dataset metadata record in JSON(-LD) format (for ingestion into DBs or other query operations)
an (auto-)completed record in tabby format (TSV, json, jsonld plain-text files) for deposition (alongside a tarball of files on a file server, for ingestion into a dataset catalog, etc)

This is related to

However, the details are different. We need to be able to

Describe self, ie. The DataLad dataset the tabby record is placed into. And we need to do that without requiring information duplication. This means NOT having the DataLad dataset UUID duplicated inside the (static) metadata record. Same for the githsha of the last commit. All this needs to be filled in automatically by a metadata extractor on-read.
Describe subdatasets. We need to only be able to express hasPart relationships, but also need to be able to locate the part inside the parent DataLad dataset

The text was updated successfully, but these errors were encountered:

jsheunis · 2023-07-11T08:26:12Z

It seems like progression of ideas here is relevant: #23

Thinking in terms of what already exists in metalad, we have the metalad_core extractor which can get all of these details of a dataset and its files when executed.

With updates to make parts of the extractor code fit into the tabby command line api, to traverse dataset files (possibly using a next iterator) and to output all as a tabby record, what would be missing?

jsheunis · 2023-07-11T08:34:40Z

So the process of running a tabby "extractor" would be something like:

find tabby records in a default location and output them into json(-ld)
extract metadata from the DataLad dataset itself and its files and output that into json(-ld)
merge the two outputs into a single tabby record, and output

christian-monch · 2023-07-11T08:52:13Z

A first comment redarding:

Describe self, ie. The DataLad dataset the tabby record is placed into. And we need to do that without requiring information duplication. This means NOT having the DataLad dataset UUID duplicated inside the (static) metadata record. Same for the githsha of the last commit. All this needs to be filled in automatically by a metadata extractor on-read.

IIUC that would mean that the self-record has to be updated in a new version because the version number changes and, for example, because a subdataset or a file are added or removed. That, in turn, means we have some part of the self-record that is updated according to a dataset version and some part that is static, because it is manually added, for example, the "dataset description". Now it seems to me that the "static" part has to be provided to a process, i.e. some form of extraction, that adds automatically extracted "state"-information from the dataset and emits the complete result. I can't shake the feeling that the "static" part is what was initially intended with tabby, and we are mixing different concerns. At least, if the datasets are DataLad datasets, we only need to provide external metadata, like, for example, "dataset description", "authors", and "publications".

Another point of view on the same itopic: Tabby-records can describe datasets, that are just a collection of files. This requires addition of, for example, file-lists by some external entity (in the end through a manually triggered and parameterized process). If we use Tabby-reocrds to describe also highly structured datasets, e.g. DataLad datasets, we can auto-generate much of the data that is manually generated in the first place. So that data should not be entered in the Tabby-record (because it is automatically created and updated). That means depending on the nature of the dataset that we describe, i.e. file-collection vs. git-repo vs. DataLad dataset, the instructions for the use of tabby would differ. Or did I get this wrong?

mslw · 2023-07-11T09:01:52Z

Regarding id & version, we could reserve some words to mean "take from DataLad", and their presence would indicate that this is a "self" description" - additional burden on the processor of tabby records, but maybe easy?

christian-monch · 2023-07-11T09:07:15Z

Follow up on my previous comment (#55 (comment)):

I was under the impression, that tabby can and will be used for datasets of the categories: "directory and file collection", "git-repo", and "DataLad dataset".

If that is the case, and we are not restricting it to use with "DataLad datasets", we should probably identify the non-extractable metadata for each scenario, i.e. manually added metadata that is not present in the dataset. We could then split the tabby-record into files that have to be created and files that will be auto-created, depending on the scenario. For example:

directory and file collection:
- tabby_dataset.tsv # no version info
- tabby_version.tsv # manually added version info
- tabby_files.tsv
- tabby_sub_dataset.tsv # not sure about this, just a placeholder
git-repo:
- tabby_dataset.tsv
  (tabby_version.tsv, tabby_files.tsv, and tabby_subdataset.tsv are automatically created, if the desired process, e.g. serialization, requires them)
DataLad dataset:
- tabby_dataset.tsv
  (tabby_version.tsv, tabby_files.tsv, and tabby_subdataset.tsv are automatically created, if the desired process, e.g. serialization, requires them)

christian-monch · 2023-07-11T09:19:00Z

Regarding id & version, we could reserve some words to mean "take from DataLad", and their presence would indicate that this is a "self" description" - additional burden on the processor of tabby records, but maybe easy?

And additional burden on the curator.

But one more important thing. If the version in form of a gitsha is part of tabby_dataset.tsv, then we have an infinite catch-up inconsistency:

We have dataset with id and version . where is a commit of the dataset. If we now add (or edit) tabby_dataset.tsv to contain version: <gitsha-1>, and commit this change, we get some version . That means the dataset_tabby.tsv in the dataset version <gitsha-2> will never identify the same version in its version-field. In this example the version-field will contain `.

That is not a principal problem, just a weird inconsistency. That would be mitigated if the versions were recorded in an extra file, e.g. tabby_version.tsv, that is never created in git/datalad datasets. For git/datalad datasets, this file will only ever be created on serialization (and might not even be necessary).

mih · 2023-07-11T10:28:08Z

Describe subdatasets. We need to only be able to express hasPart relationships, but also need to be able to locate the part inside the parent DataLad dataset

Best thing I can come up with is using the name property of the (dataset) entity in linked via hasPart.

However, that does not work. It becomes obvious when thinking about the resulting graph structure. hasPart is the edge that linkes two nodes of type dataset. The subdataset (node) can take that role in any other context too, so it cannot host the property that identifies its location within the scope of a particular superdataset -- we need a dedicated type for that.

mih · 2023-07-17T07:02:46Z

If that is the case, and we are not restricting it to use with "DataLad datasets", we should probably identify the non-extractable metadata for each scenario, i.e. manually added metadata that is not present in the dataset.

I don't think this is a good plan. tabby nohow restricts what information goes into which table or dictates the names and semantics of tables (beyond the convention that dataset is the starting point). I see no reason to change that which complicated/implicit semantics.

Given that this is all linked data, it should not be an issue for any tooling to amend any record on a dataset with a version of that dataset. All it needs is knowledge on the identifier used for that dataset.

The same goes for file records. As long as it is known how files are identified, any tooling (tabby, datalad, or not) can provide additional documents/records on file objects. There is no need to constrain formats and workflows, AFAICS.

mih · 2023-07-17T08:22:44Z

With #50 settled, we know that any tabby record will have a file name prefix. This implies that multiple tabby records per directory can always be represented in a non-conflicting fashion, given sensible prefixes.

So in principle, supporting self-description means:

declare a root directory for such records
declare a prefix scheme

There is the possibility to use the outcome of #51 to determine the organization. However, that would be overkill. A dataset self-description would be subject to versioning itself, and using a version-dependent location for the metadata record is not sensible. A generic version label, such as lastest may still yield a a location that is hard to find (in a largely record collection).

I propose to

declare .datalad/tabby as the root directory of any tabby record collection in a dataset
use the prefix self for a dataset self-description, leading to .datalad/tabby/self_dataset.tsv as the entrypoint of a self-description
declare .datalad/tabby/collection as the root directory for any additional tabby record organization (see Utility command to insert metadata records into a tree of versioned datasets #51), to keep it separate and easily distinguished from a self-description

In preparation for #79 and despite the conclusion in #50 this change adds support for a simplified set of files that form a tabby record. The only thing that is simplified is that the common prefix is removed from all filename. The demo record is not also included in this format. This layout is what we would like put into a ZIP file container. The prefix continues to exist (this was the main concern in #50), but is now the name of the parent directory. In #55 this simplifies the setup for the self-description of a dataset. All files could go into `.datalad/tabby/self/` and have short names like: - `dataset.tsv` - `dataset.override.json` - ... There is no particular additional markup necessary to distinguish single-item-dir format from the prefixed-layout. The absence of an underscore char, is evidence enough. Closes #50 (for real)

mih mentioned this issue Jul 11, 2023

Define DataLad metadata model #56

Closed

This was referenced Jul 17, 2023

Utility command to insert metadata records into a tree of versioned datasets #51

Closed

Define @id for dataset content: File(Version) and Subdataset(Version) #78

Closed

mih mentioned this issue Jul 19, 2023

Support 'single-record-per-directory' format #82

Merged

mih self-assigned this Jul 20, 2023

mih mentioned this issue Jul 20, 2023

Minimalistic tabby metadata extractor with datalad-metalad #84

Merged

3 tasks

mih closed this as completed in #84 Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support self-description of a dataset with tabby #55

Support self-description of a dataset with tabby #55

mih commented Jul 11, 2023

jsheunis commented Jul 11, 2023

jsheunis commented Jul 11, 2023

christian-monch commented Jul 11, 2023

mslw commented Jul 11, 2023

christian-monch commented Jul 11, 2023 •

edited

Loading

christian-monch commented Jul 11, 2023 •

edited

Loading

mih commented Jul 11, 2023

mih commented Jul 17, 2023 •

edited

Loading

mih commented Jul 17, 2023

Support self-description of a dataset with tabby #55

Support self-description of a dataset with tabby #55

Comments

mih commented Jul 11, 2023

jsheunis commented Jul 11, 2023

jsheunis commented Jul 11, 2023

christian-monch commented Jul 11, 2023

mslw commented Jul 11, 2023

christian-monch commented Jul 11, 2023 • edited Loading

christian-monch commented Jul 11, 2023 • edited Loading

mih commented Jul 11, 2023

mih commented Jul 17, 2023 • edited Loading

mih commented Jul 17, 2023

christian-monch commented Jul 11, 2023 •

edited

Loading

christian-monch commented Jul 11, 2023 •

edited

Loading

mih commented Jul 17, 2023 •

edited

Loading