-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support self-description of a dataset with tabby #55
Comments
It seems like progression of ideas here is relevant: #23 Thinking in terms of what already exists in metalad, we have the With updates to make parts of the extractor code fit into the tabby command line api, to traverse dataset files (possibly using a |
So the process of running a tabby "extractor" would be something like:
|
A first comment redarding:
IIUC that would mean that the Another point of view on the same itopic: Tabby-records can describe datasets, that are just a collection of files. This requires addition of, for example, file-lists by some external entity (in the end through a manually triggered and parameterized process). If we use Tabby-reocrds to describe also highly structured datasets, e.g. DataLad datasets, we can auto-generate much of the data that is manually generated in the first place. So that data should not be entered in the Tabby-record (because it is automatically created and updated). That means depending on the nature of the dataset that we describe, i.e. file-collection vs. git-repo vs. DataLad dataset, the instructions for the use of tabby would differ. Or did I get this wrong? |
Regarding id & version, we could reserve some words to mean "take from DataLad", and their presence would indicate that this is a "self" description" - additional burden on the processor of tabby records, but maybe easy? |
Follow up on my previous comment (#55 (comment)): I was under the impression, that tabby can and will be used for datasets of the categories: "directory and file collection", "git-repo", and "DataLad dataset". If that is the case, and we are not restricting it to use with "DataLad datasets", we should probably identify the non-extractable metadata for each scenario, i.e. manually added metadata that is not present in the dataset. We could then split the tabby-record into files that have to be created and files that will be auto-created, depending on the scenario. For example:
|
And additional burden on the curator. But one more important thing. If the version in form of a gitsha is part of tabby_dataset.tsv, then we have an infinite catch-up inconsistency: We have dataset with id and version . where is a commit of the dataset. If we now add (or edit) That is not a principal problem, just a weird inconsistency. That would be mitigated if the versions were recorded in an extra file, e.g. |
Best thing I can come up with is using the However, that does not work. It becomes obvious when thinking about the resulting graph structure. |
I don't think this is a good plan. Given that this is all linked data, it should not be an issue for any tooling to amend any record on a dataset with a version of that dataset. All it needs is knowledge on the identifier used for that dataset. The same goes for file records. As long as it is known how files are identified, any tooling (tabby, datalad, or not) can provide additional documents/records on file objects. There is no need to constrain formats and workflows, AFAICS. |
With #50 settled, we know that any tabby record will have a file name prefix. This implies that multiple tabby records per directory can always be represented in a non-conflicting fashion, given sensible prefixes. So in principle, supporting self-description means:
There is the possibility to use the outcome of #51 to determine the organization. However, that would be overkill. A dataset self-description would be subject to versioning itself, and using a version-dependent location for the metadata record is not sensible. A generic I propose to
|
In preparation for #79 and despite the conclusion in #50 this change adds support for a simplified set of files that form a tabby record. The only thing that is simplified is that the common prefix is removed from all filename. The demo record is not also included in this format. This layout is what we would like put into a ZIP file container. The prefix continues to exist (this was the main concern in #50), but is now the name of the parent directory. In #55 this simplifies the setup for the self-description of a dataset. All files could go into `.datalad/tabby/self/` and have short names like: - `dataset.tsv` - `dataset.override.json` - ... There is no particular additional markup necessary to distinguish single-item-dir format from the prefixed-layout. The absence of an underscore char, is evidence enough. Closes #50 (for real)
In preparation for #79 and despite the conclusion in #50 this change adds support for a simplified set of files that form a tabby record. The only thing that is simplified is that the common prefix is removed from all filename. The demo record is not also included in this format. This layout is what we would like put into a ZIP file container. The prefix continues to exist (this was the main concern in #50), but is now the name of the parent directory. In #55 this simplifies the setup for the self-description of a dataset. All files could go into `.datalad/tabby/self/` and have short names like: - `dataset.tsv` - `dataset.override.json` - ... There is no particular additional markup necessary to distinguish single-item-dir format from the prefixed-layout. The absence of an underscore char, is evidence enough. Closes #50 (for real)
The format should be usable for providing metadata on a dataset in its DataLad dataset form. Practically, this means that we need to put TSV, json, jsonld files in some standard location, and we need to have a metadata extractor (or other utility) that can amend such a manual record with auto-generated information on files (and possibly other information).
Two consumption scenarios are of interest:
This is related to
However, the details are different. We need to be able to
self
, ie. The DataLad dataset the tabby record is placed into. And we need to do that without requiring information duplication. This means NOT having the DataLad dataset UUID duplicated inside the (static) metadata record. Same for the githsha of the last commit. All this needs to be filled in automatically by a metadata extractor on-read.hasPart
relationships, but also need to be able to locate the part inside the parent DataLad datasetThe text was updated successfully, but these errors were encountered: