Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support self-description of a dataset with tabby #55

Closed
mih opened this issue Jul 11, 2023 · 9 comments · Fixed by #84
Closed

Support self-description of a dataset with tabby #55

mih opened this issue Jul 11, 2023 · 9 comments · Fixed by #84
Assignees

Comments

@mih
Copy link
Contributor

mih commented Jul 11, 2023

The format should be usable for providing metadata on a dataset in its DataLad dataset form. Practically, this means that we need to put TSV, json, jsonld files in some standard location, and we need to have a metadata extractor (or other utility) that can amend such a manual record with auto-generated information on files (and possibly other information).

Two consumption scenarios are of interest:

  • full dataset metadata record in JSON(-LD) format (for ingestion into DBs or other query operations)
  • an (auto-)completed record in tabby format (TSV, json, jsonld plain-text files) for deposition (alongside a tarball of files on a file server, for ingestion into a dataset catalog, etc)

This is related to

However, the details are different. We need to be able to

  • Describe self, ie. The DataLad dataset the tabby record is placed into. And we need to do that without requiring information duplication. This means NOT having the DataLad dataset UUID duplicated inside the (static) metadata record. Same for the githsha of the last commit. All this needs to be filled in automatically by a metadata extractor on-read.
  • Describe subdatasets. We need to only be able to express hasPart relationships, but also need to be able to locate the part inside the parent DataLad dataset
@jsheunis
Copy link
Contributor

It seems like progression of ideas here is relevant: #23

Thinking in terms of what already exists in metalad, we have the metalad_core extractor which can get all of these details of a dataset and its files when executed.

With updates to make parts of the extractor code fit into the tabby command line api, to traverse dataset files (possibly using a next iterator) and to output all as a tabby record, what would be missing?

@jsheunis
Copy link
Contributor

So the process of running a tabby "extractor" would be something like:

  • find tabby records in a default location and output them into json(-ld)
  • extract metadata from the DataLad dataset itself and its files and output that into json(-ld)
  • merge the two outputs into a single tabby record, and output

@christian-monch
Copy link
Contributor

A first comment redarding:

  • Describe self, ie. The DataLad dataset the tabby record is placed into. And we need to do that without requiring information duplication. This means NOT having the DataLad dataset UUID duplicated inside the (static) metadata record. Same for the githsha of the last commit. All this needs to be filled in automatically by a metadata extractor on-read.

IIUC that would mean that the self-record has to be updated in a new version because the version number changes and, for example, because a subdataset or a file are added or removed. That, in turn, means we have some part of the self-record that is updated according to a dataset version and some part that is static, because it is manually added, for example, the "dataset description". Now it seems to me that the "static" part has to be provided to a process, i.e. some form of extraction, that adds automatically extracted "state"-information from the dataset and emits the complete result. I can't shake the feeling that the "static" part is what was initially intended with tabby, and we are mixing different concerns. At least, if the datasets are DataLad datasets, we only need to provide external metadata, like, for example, "dataset description", "authors", and "publications".

Another point of view on the same itopic: Tabby-records can describe datasets, that are just a collection of files. This requires addition of, for example, file-lists by some external entity (in the end through a manually triggered and parameterized process). If we use Tabby-reocrds to describe also highly structured datasets, e.g. DataLad datasets, we can auto-generate much of the data that is manually generated in the first place. So that data should not be entered in the Tabby-record (because it is automatically created and updated). That means depending on the nature of the dataset that we describe, i.e. file-collection vs. git-repo vs. DataLad dataset, the instructions for the use of tabby would differ. Or did I get this wrong?

@mslw
Copy link
Contributor

mslw commented Jul 11, 2023

Regarding id & version, we could reserve some words to mean "take from DataLad", and their presence would indicate that this is a "self" description" - additional burden on the processor of tabby records, but maybe easy?

@christian-monch
Copy link
Contributor

christian-monch commented Jul 11, 2023

Follow up on my previous comment (#55 (comment)):

I was under the impression, that tabby can and will be used for datasets of the categories: "directory and file collection", "git-repo", and "DataLad dataset".

If that is the case, and we are not restricting it to use with "DataLad datasets", we should probably identify the non-extractable metadata for each scenario, i.e. manually added metadata that is not present in the dataset. We could then split the tabby-record into files that have to be created and files that will be auto-created, depending on the scenario. For example:

  • directory and file collection:

    • tabby_dataset.tsv # no version info
    • tabby_version.tsv # manually added version info
    • tabby_files.tsv
    • tabby_sub_dataset.tsv # not sure about this, just a placeholder
  • git-repo:

    • tabby_dataset.tsv
      (tabby_version.tsv, tabby_files.tsv, and tabby_subdataset.tsv are automatically created, if the desired process, e.g. serialization, requires them)
  • DataLad dataset:

    • tabby_dataset.tsv
      (tabby_version.tsv, tabby_files.tsv, and tabby_subdataset.tsv are automatically created, if the desired process, e.g. serialization, requires them)

@christian-monch
Copy link
Contributor

christian-monch commented Jul 11, 2023

Regarding id & version, we could reserve some words to mean "take from DataLad", and their presence would indicate that this is a "self" description" - additional burden on the processor of tabby records, but maybe easy?

And additional burden on the curator.

But one more important thing. If the version in form of a gitsha is part of tabby_dataset.tsv, then we have an infinite catch-up inconsistency:

We have dataset with id and version . where is a commit of the dataset. If we now add (or edit) tabby_dataset.tsv to contain version: <gitsha-1>, and commit this change, we get some version . That means the dataset_tabby.tsv in the dataset version <gitsha-2> will never identify the same version in its version-field. In this example the version-field will contain `.

That is not a principal problem, just a weird inconsistency. That would be mitigated if the versions were recorded in an extra file, e.g. tabby_version.tsv, that is never created in git/datalad datasets. For git/datalad datasets, this file will only ever be created on serialization (and might not even be necessary).

@mih
Copy link
Contributor Author

mih commented Jul 11, 2023

Describe subdatasets. We need to only be able to express hasPart relationships, but also need to be able to locate the part inside the parent DataLad dataset

Best thing I can come up with is using the name property of the (dataset) entity in linked via hasPart.

However, that does not work. It becomes obvious when thinking about the resulting graph structure. hasPart is the edge that linkes two nodes of type dataset. The subdataset (node) can take that role in any other context too, so it cannot host the property that identifies its location within the scope of a particular superdataset -- we need a dedicated type for that.

@mih
Copy link
Contributor Author

mih commented Jul 17, 2023

If that is the case, and we are not restricting it to use with "DataLad datasets", we should probably identify the non-extractable metadata for each scenario, i.e. manually added metadata that is not present in the dataset.

I don't think this is a good plan. tabby nohow restricts what information goes into which table or dictates the names and semantics of tables (beyond the convention that dataset is the starting point). I see no reason to change that which complicated/implicit semantics.

Given that this is all linked data, it should not be an issue for any tooling to amend any record on a dataset with a version of that dataset. All it needs is knowledge on the identifier used for that dataset.

The same goes for file records. As long as it is known how files are identified, any tooling (tabby, datalad, or not) can provide additional documents/records on file objects. There is no need to constrain formats and workflows, AFAICS.

@mih
Copy link
Contributor Author

mih commented Jul 17, 2023

With #50 settled, we know that any tabby record will have a file name prefix. This implies that multiple tabby records per directory can always be represented in a non-conflicting fashion, given sensible prefixes.

So in principle, supporting self-description means:

  • declare a root directory for such records
  • declare a prefix scheme

There is the possibility to use the outcome of #51 to determine the organization. However, that would be overkill. A dataset self-description would be subject to versioning itself, and using a version-dependent location for the metadata record is not sensible. A generic version label, such as lastest may still yield a a location that is hard to find (in a largely record collection).

I propose to

  • declare .datalad/tabby as the root directory of any tabby record collection in a dataset
  • use the prefix self for a dataset self-description, leading to .datalad/tabby/self_dataset.tsv as the entrypoint of a self-description
  • declare .datalad/tabby/collection as the root directory for any additional tabby record organization (see Utility command to insert metadata records into a tree of versioned datasets #51), to keep it separate and easily distinguished from a self-description

mih added a commit that referenced this issue Jul 19, 2023
In preparation for
#79 and
despite the conclusion in
#50 this
change adds support for a simplified set of files that form a tabby
record.

The only thing that is simplified is that the common prefix is removed
from all filename. The demo record is not also included in this format.

This layout is what we would like put into a ZIP file container.

The prefix continues to exist (this was the main concern in #50), but is
now the name of the parent directory.

In #55
this simplifies the setup for the self-description of a dataset. All
files could go into `.datalad/tabby/self/` and have short names like:

- `dataset.tsv`
- `dataset.override.json`
- ...

There is no particular additional markup necessary to distinguish
single-item-dir format from the prefixed-layout. The absence of an
underscore char, is evidence enough.

Closes #50 (for real)
mih added a commit that referenced this issue Jul 19, 2023
In preparation for
#79 and
despite the conclusion in
#50 this
change adds support for a simplified set of files that form a tabby
record.

The only thing that is simplified is that the common prefix is removed
from all filename. The demo record is not also included in this format.

This layout is what we would like put into a ZIP file container.

The prefix continues to exist (this was the main concern in #50), but is
now the name of the parent directory.

In #55
this simplifies the setup for the self-description of a dataset. All
files could go into `.datalad/tabby/self/` and have short names like:

- `dataset.tsv`
- `dataset.override.json`
- ...

There is no particular additional markup necessary to distinguish
single-item-dir format from the prefixed-layout. The absence of an
underscore char, is evidence enough.

Closes #50 (for real)
@mih mih self-assigned this Jul 20, 2023
@mih mih closed this as completed in #84 Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants