get_file_list #15

gwbischof · 2020-05-26T20:21:57Z

A talk with @danielballan led us to these questions.

Can the handler methods get_file_list be part of the handler class or be moved to event_model?

It appears that none of the handlers in handlers.py need to open the file to get the list of files for the run. And the file list can be determined from the mongo documents.

If get_file_list is very specific to the handler, maybe we leave this method as part of the Handler class, but change it so that it can be called, without calling the init first.

Some of the handler inits call open on the file, and I would like to get the list of files with access to the file.

Thoughts?

danielballan · 2020-05-26T20:28:55Z

the file list can be determined from the mongo documents.

This may be true now but I think it's important that we leave open the possibility that we may encounter formats where it won't be true. We emit Resource documents early in the acquisition process, and it's possible to imagine situations where we might not know which / how many files a certain detector will create, and where we would need access to the files themselves to work that out.

If get_file_list is very specific to the handler, maybe we leave this method as part of the Handler class, but change it so that it can be called, without calling the __init__ first.

Right. To iterate on that idea a bit, we could add an optional classmethod to the API which returns the fail list. In situations where that's not possible without opening the file, we would not implement the method.

stuartcampbell · 2020-05-26T20:35:21Z

Do we want to think about what might be in the future, can we see a time where the 'resource' list might not be a list of files, but a list of other URIs (or something else) ?

danielballan · 2020-05-26T20:40:09Z

A related to point to @stuartcampbell's --- We want to keep an eye on making handlers simpler, perhaps reducing them to something more like a Reader https://github.com/danielballan/pims2-prototype that supports the built-in open API rather than ever-more databroker specific. Related to bluesky/event-model#156

tacaswell · 2020-05-26T22:04:05Z

No, the exact list if files is something that can only be determined by the handlers because we need to support both root_map and the ability to support file formats that may produce an arbitrary number of files.

We need to be able to, at access time, be able to re-map the "root" from where it was at collection time. Even if we had a unified file system across the facility, we would not be able to get that across the street or on users home institutions. I think the tooling built in databroker-pack which automates (re)writing root maps validates this design.

For simplifying the handlers we chose to do the root re-mapping in filestore/databroker/filler code and then pass into the handler a string that is an absolute path. However, this means that the only place that we can have something that looks like a file path is the path related entries in the Resource document and the resource_path as the root map is not passed into the handler.

In some cases (such as the tiff writer) we do not know at the time when we create the resource how many files will be created (because there is one frame per file). In the tiff case the 'resource_path' that comes in as actually a template string that we then push the datum kwargs through. I think this case alone prevents the file list from being handled at the event model level (as we need to understand that some strings are templates and some are not) to moving this logic out up. The reason handlers exist as a concept is to absorb this sort of complexity and shield the next layer up (the client of the Filler / AssetStore / Filestore) from it.

In the case of tiff we can dead-reckon what the files should be from just the data in the documents, however there are detectors that (for what ever reason) write out data into chunked files. You may not know until you start reading the files how many of the files exist which means in general we can not assume there is enough information in the documents + the handler code to sort out what files we are going to touch. I have a suspicion (but no proof) that at least some of the hdf5 handlers are getting their file lists wrong due to the use of external links in hdf5 files.

I am also not convinced that a class method that gets to see all of the resource and datum documents is simpler than initing the handler and calling get_file_list. That said, I could see wanting to make different assumptions about how aggressively to open files / pre-load data if you know it is for actually data access vs filelisting so something like

@classmethod
def file_list(cls, mapped_resource_path, resource_kwargs, datum_kwargs_gen):
     h = cls(mapped_resource_path, **resource_kwargs)
     return list(h.get_file_list(datum_kwargs_gen)

Even if it were technically possible, I don't think that it is a good idea to push this logic "up". The "external data" machinery is to encode that there is data not literally in the documents and instructions in how to get it while being agnostic to how that information is encoded at rest. For example, the current scheme holds together if we were to moving to an object store or database (or pure function) to store the external data. Once you push the concept of "files" into event model, you are committing to a posix model for storing the external data (one of the mistakes we made with resource / datum is that our names are too filecentric) which is the opposite of what we want to do.

One way to think of both the handler registry and root_map is that from the point of view of the event model these are foreign keys that are (implicitly) joined with the document stream at run time. The root map lets you say "the filesytem and be remounted anywhere we provide a way to patch that up at read time". The spec + resource scheme lets you say "here is a function to call to get the data you want". In principle this could have been done in one layer, but in general there are some resources you want to be able to acquire once and then re-use (for example, if you want to pull 1k planes out of an hdf5 file, you want to open the file exactly once and than make calls to the open hdf5 file rather than open an close the file 1k times (@dmgav has noted this problem an SRX where each row is stored in it's own file and a majority of the read time is opening hdf5 files) so the ((spec_map + root_map + resource_doc) -> handler and then (handler + datum_doc) -> data) scheme lets the data generation process provide hints to the data access layer about what the right scope of shared resources is.

There is some level of inherent complexity in every system that you can not reduce (you can certainly add complexity ;)). Managing the zoo of file formats / layouts / filesystems is a particularly nasty bit of book keeping that we are managing in the handlers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_file_list #15

get_file_list #15

gwbischof commented May 26, 2020 •

edited

Loading

danielballan commented May 26, 2020

stuartcampbell commented May 26, 2020

danielballan commented May 26, 2020

tacaswell commented May 26, 2020

get_file_list #15

get_file_list #15

Comments

gwbischof commented May 26, 2020 • edited Loading

danielballan commented May 26, 2020

stuartcampbell commented May 26, 2020

danielballan commented May 26, 2020

tacaswell commented May 26, 2020

gwbischof commented May 26, 2020 •

edited

Loading