Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_file_list #15

Open
gwbischof opened this issue May 26, 2020 · 4 comments
Open

get_file_list #15

gwbischof opened this issue May 26, 2020 · 4 comments

Comments

@gwbischof
Copy link
Contributor

gwbischof commented May 26, 2020

A talk with @danielballan led us to these questions.

Can the handler methods get_file_list be part of the handler class or be moved to event_model?

It appears that none of the handlers in handlers.py need to open the file to get the list of files for the run. And the file list can be determined from the mongo documents.

If get_file_list is very specific to the handler, maybe we leave this method as part of the Handler class, but change it so that it can be called, without calling the init first.

Some of the handler inits call open on the file, and I would like to get the list of files with access to the file.

Thoughts?

@danielballan
Copy link
Member

the file list can be determined from the mongo documents.

This may be true now but I think it's important that we leave open the possibility that we may encounter formats where it won't be true. We emit Resource documents early in the acquisition process, and it's possible to imagine situations where we might not know which / how many files a certain detector will create, and where we would need access to the files themselves to work that out.

If get_file_list is very specific to the handler, maybe we leave this method as part of the Handler class, but change it so that it can be called, without calling the __init__ first.

Right. To iterate on that idea a bit, we could add an optional classmethod to the API which returns the fail list. In situations where that's not possible without opening the file, we would not implement the method.

@stuartcampbell
Copy link
Member

Do we want to think about what might be in the future, can we see a time where the 'resource' list might not be a list of files, but a list of other URIs (or something else) ?

@danielballan
Copy link
Member

A related to point to @stuartcampbell's --- We want to keep an eye on making handlers simpler, perhaps reducing them to something more like a Reader https://github.com/danielballan/pims2-prototype that supports the built-in open API rather than ever-more databroker specific. Related to bluesky/event-model#156

@tacaswell
Copy link
Contributor

No, the exact list if files is something that can only be determined by the handlers because we need to support both root_map and the ability to support file formats that may produce an arbitrary number of files.

We need to be able to, at access time, be able to re-map the "root" from where it was at collection time. Even if we had a unified file system across the facility, we would not be able to get that across the street or on users home institutions. I think the tooling built in databroker-pack which automates (re)writing root maps validates this design.

For simplifying the handlers we chose to do the root re-mapping in filestore/databroker/filler code and then pass into the handler a string that is an absolute path. However, this means that the only place that we can have something that looks like a file path is the path related entries in the Resource document and the resource_path as the root map is not passed into the handler.

In some cases (such as the tiff writer) we do not know at the time when we create the resource how many files will be created (because there is one frame per file). In the tiff case the 'resource_path' that comes in as actually a template string that we then push the datum kwargs through. I think this case alone prevents the file list from being handled at the event model level (as we need to understand that some strings are templates and some are not) to moving this logic out up. The reason handlers exist as a concept is to absorb this sort of complexity and shield the next layer up (the client of the Filler / AssetStore / Filestore) from it.

In the case of tiff we can dead-reckon what the files should be from just the data in the documents, however there are detectors that (for what ever reason) write out data into chunked files. You may not know until you start reading the files how many of the files exist which means in general we can not assume there is enough information in the documents + the handler code to sort out what files we are going to touch. I have a suspicion (but no proof) that at least some of the hdf5 handlers are getting their file lists wrong due to the use of external links in hdf5 files.

I am also not convinced that a class method that gets to see all of the resource and datum documents is simpler than initing the handler and calling get_file_list. That said, I could see wanting to make different assumptions about how aggressively to open files / pre-load data if you know it is for actually data access vs filelisting so something like

@classmethod
def file_list(cls, mapped_resource_path, resource_kwargs, datum_kwargs_gen):
     h = cls(mapped_resource_path, **resource_kwargs)
     return list(h.get_file_list(datum_kwargs_gen)

Even if it were technically possible, I don't think that it is a good idea to push this logic "up". The "external data" machinery is to encode that there is data not literally in the documents and instructions in how to get it while being agnostic to how that information is encoded at rest. For example, the current scheme holds together if we were to moving to an object store or database (or pure function) to store the external data. Once you push the concept of "files" into event model, you are committing to a posix model for storing the external data (one of the mistakes we made with resource / datum is that our names are too filecentric) which is the opposite of what we want to do.


One way to think of both the handler registry and root_map is that from the point of view of the event model these are foreign keys that are (implicitly) joined with the document stream at run time. The root map lets you say "the filesytem and be remounted anywhere we provide a way to patch that up at read time". The spec + resource scheme lets you say "here is a function to call to get the data you want". In principle this could have been done in one layer, but in general there are some resources you want to be able to acquire once and then re-use (for example, if you want to pull 1k planes out of an hdf5 file, you want to open the file exactly once and than make calls to the open hdf5 file rather than open an close the file 1k times (@dmgav has noted this problem an SRX where each row is stored in it's own file and a majority of the read time is opening hdf5 files) so the ((spec_map + root_map + resource_doc) -> handler and then (handler + datum_doc) -> data) scheme lets the data generation process provide hints to the data access layer about what the right scope of shared resources is.


There is some level of inherent complexity in every system that you can not reduce (you can certainly add complexity ;)). Managing the zoo of file formats / layouts / filesystems is a particularly nasty bit of book keeping that we are managing in the handlers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants