Skip to content
This repository has been archived by the owner on May 15, 2024. It is now read-only.

Usage:format: per-usage format spec #54

Closed
wants to merge 3 commits into from
Closed

Usage:format: per-usage format spec #54

wants to merge 3 commits into from

Conversation

PeterKraus
Copy link
Collaborator

@PeterKraus PeterKraus commented Mar 12, 2024

Sparked by marda-alliance/metadata_extractors_registry#78 (comment)

It might be useful to have a mechanism to indicate what package/library needs to be present in the "caller" environment in order to understand the format of the objects returned in-memory.

Currently, we only have an install target of [formats] in the API, which installs xarray and pandas into the parent environment. However, if the required library is not present in the parent environment, the unpickling of the shared memory object will fail. We can annotate what's required (should be a single library per usage, in my opinion) here, and then modify the API to use this data.

See the Extractor-datatree.yaml example file to see what I mean in more detail.

Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one needs a bit more discussion tbh. For this field to be truly machine-actionable it somehow also needs to have the same granularity as the install instructions (in my mind), which also feels like overkill... I get the idea that each extractor command may have a different "format", but I was imagining something more like an additional config in the install metadata that specifies any arbitrary output_requirements or somesuch. This is all pretty awkward anyway, as the point of the API is that each extractor is isolated, so if we're allowing extractors to mark that they need arbitrary (/conflicting) Python reqs installed in the top-level executing environment then we're kinda scuppered.

This leaves us with the current option, that there are a set of "blessed" packages, e.g., pandas, xarray, (perhaps nexus) and then of course any generic Python objects, that are "supported" in this mode, in which case this field does not need to be machine-actionable but can provide useful info to a user not using the reference implementation.

What do you think?

@PeterKraus
Copy link
Collaborator Author

I expected this to need some thought.

I don't want to implement another package manager (hence the npm story I shared with you). The idea for this one was really to provide a way to:

  • add some metadata on what's required downstream of the tractor to "understand" what's coming in memory from upstream
  • search the yard, for instance if galvani needs just a pandas but yadg needs xarray ...
  • be able to provide a better hint in tractor beam about what's gone wrong when the unpickling fails

I am also not 100% happy with calling it "format" and the description could be improved to clarify the above. Would you be OK with that?

@PeterKraus
Copy link
Collaborator Author

On today's meeting, we've decided that it's reasonable to expect the in-memory returned object to be either a native python object, or something that is understood by pandas or xarray (i.e. the current content of [formats] which ought to be installed by default, see marda-alliance/metadata_extractors_api#33).

We hope to implement a proper output spec once more packages are in the registry. Closing.

@PeterKraus PeterKraus closed this Apr 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants