Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add computable report when graceful-read-failure is active in from_map #415

Closed
wants to merge 16 commits into from

Conversation

douglasdavis
Copy link
Collaborator

@douglasdavis douglasdavis commented Nov 14, 2023

This changes the from_map interface to return two collections when empty_on_raise and empty_backend are both not None. (When those arguments are both None we just return a single Array, which has been the existing behavior until this PR, hopefully the @overloads show this in a clear way)

  • The first returned Array collection is what we've always had: the lazy awkward array populated by data on disk.
  • The second returned Array is a record array, one element per partition, that represents a report about failed reads.

After banging my head thinking of ways to handle reports inside the graph I realized it's possible for us to just make it an awkward array (instead of some new object, the rabbit hole I originally went down). I still need to check if this breaks any optimizations, but my intuition is that it shouldn't. I still wouldn't bet the farm :)

cc @lgray @agoose77 for your thoughts. This PR demonstrates the possibility, and I think it fits in nicely with the existing uproot.iterate(..., report=True) two-item return pattern.

@douglasdavis douglasdavis marked this pull request as draft November 14, 2023 17:59
@codecov-commenter
Copy link

codecov-commenter commented Nov 14, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (dd3267a) 93.21% compared to head (06fb97e) 93.21%.

Files Patch % Lines
src/dask_awkward/layers/layers.py 90.90% 2 Missing ⚠️
src/dask_awkward/lib/io/io.py 97.22% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #415   +/-   ##
=======================================
  Coverage   93.21%   93.21%           
=======================================
  Files          23       23           
  Lines        3168     3214   +46     
=======================================
+ Hits         2953     2996   +43     
- Misses        215      218    +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +691 to +794
res = result.map_partitions(first, meta=array_meta, output_divisions=1)
rep = result.map_partitions(second, meta=empty_typetracer())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here res is the actual interesting Array collection. The array_meta object is the meta we'd normally expect without going through the process of dual returns with the report. We can manually pass that meta object to map_partitions to override its default behavior of automatically determining a new meta, we also make sure to preserve divisions if they're known.

And rep is the new report Array collection, it doesn't matter what the meta or divisions are, this should be a pretty small object once computed. If/when we converge on a proper schema for a report array perhaps we can pass in its typetracer version for absolute correctness here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not specify meta and let dask-awkward figure it out using the graph?

@douglasdavis douglasdavis force-pushed the failure-report branch 3 times, most recently from 82ace9f to 31e570d Compare November 16, 2023 22:12
@agoose77 agoose77 self-requested a review November 26, 2023 20:46
@douglasdavis douglasdavis marked this pull request as ready for review November 27, 2023 17:12
@douglasdavis
Copy link
Collaborator Author

douglasdavis commented Nov 27, 2023

I just gave the schema a little better definition, here it is in JSONSchema:

{
    "type": "object",
    "properties": {
        "args": {"type": "array", "items": {"type": "string"}},
        "kwargs": {
            "type": "array",
            "items": {"type": "array", "items": {"type": "string"}},
        },
        "exception": {"type": "string"},
        "message": {"type": "string"},
    },
}

We report back the function arguments, keyword arguments, the exception name, and the message associated with it.

  • args is a list of strings
  • kwargs is a list of list of strings, the inner list is always length two: [key, value]
  • exception is a list of strings
  • message is a list of strings

The report is always the same length as there are number of partitions. So if we have five partitions and two failures, the computed report will end up looking something like (making up some arguments and a single keyword argument):

[
    {'args': [], 'kwargs': [], 'exception': '', 'message': ''},   # <-- succeeded
    {'args': [], 'kwargs': [], 'exception': '', 'message': ''},   # <-- succeeded
    {'args': ["/path/to/file2", "True", "2"], 'kwargs': [["kwarg1", "foo"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
    {'args': [], 'kwargs': [], 'exception': '', 'message': ''},   # <-- succeeded
    {'args': ["/path/to/file4", "True", "4"], 'kwargs': [["kwarg1", "bar"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
]

@lgray
Copy link
Collaborator

lgray commented Nov 28, 2023

Oh - this is very good. I'll try to test it asap. If, instead of the partition number, it would just report the start and stop rows it would be more immediately useful to users I think. Otherwise there's some digging to do.

@lgray
Copy link
Collaborator

lgray commented Nov 28, 2023

Will this scale OK when there are 200k partitions?

@douglasdavis
Copy link
Collaborator Author

Will this scale OK when there are 200k partitions?

I imagine it would. I think a 200k element array isn't too crazy, especially when most elements will be records without much data.

An alternative approach to shrink the report would be to actually have the elements of the "report array" that represent successful reads to just be None instead of the thin record. Something like:

[
    None,
    None,
    {'args': ["/path/to/file2", "True", "2"], 'kwargs': [["kwarg1", "foo"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
    None,
    {'args': ["/path/to/file4", "True", "4"], 'kwargs': [["kwarg1", "bar"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
]

instead of what I wrote in my above comment.

If, instead of the partition number, it would just report the start and stop rows it would be more immediately useful to users I think

Seems like we may want to add the ability to have consumers of this API be able to pass in a callback function that accepts the *args and **kwargs (with domain knowledge) and construct something to add to the report. As it stands dask-awkward can only blindly accept the args and kwargs.

@douglasdavis
Copy link
Collaborator Author

Ah I may have misunderstood, if the divisions are known it would be possible for dask-awkward to combine the known divisions with the failed partitions.

@lgray
Copy link
Collaborator

lgray commented Nov 28, 2023

Yeah I like the idea of being able to control the form of what gets returned, with some decent presets.

You understood my question the first time correctly! If there are (somehow) 200k file-read errors are we going to cause further errors or hassle for the user. :-D Seems like not so much, but I wanted to double check with you.

@douglasdavis
Copy link
Collaborator Author

Great, I'm working on making what gets reported customizable, with what I think is a sane default.

@lgray
Copy link
Collaborator

lgray commented Nov 28, 2023

Oh I see what you mean.

Yeah I mean for the return:

[
    None,
    None,
    {'args': ["/path/to/file2", "True", "2"], 'kwargs': [["kwarg1", "foo"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
    None,
    {'args': ["/path/to/file4", "True", "4"], 'kwargs': [["kwarg1", "bar"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
]

something like:

[
    None,
    None,
    {'args': ["/path/to/file2", "True", "200-300"], 'kwargs': [["kwarg1", "foo"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
    None,
    {'args': ["/path/to/file4", "True", "400-500"], 'kwargs': [["kwarg1", "bar"]], 'exception': 'RuntimeError', 'message': 'Failed to read.'},
]

Would be more immediately useful.
If that last value is the partition number.

@douglasdavis
Copy link
Collaborator Author

Ah in my little example that was just some made up set of arguments that were getting passed to the actual read function. So in the uproot case it would be the arguments passed to the callable class _UprootOpenAndRead, for example.

I've just pushed a commit that makes the report construction customizable but with the default that we've already been discussing.

To customize, the consumer of this API needs to define two functions: a success function and a failure function. The success function just takes f(*args, **kwargs), while the failure function takes f(exception, *args, **kwargs). Both must return an awkward array. The default versions for each are pretty easy to interpret.

from_map now has four new arguments -- I'm not loving it but I'm also having a hard time picturing an alternative

@douglasdavis
Copy link
Collaborator Author

douglasdavis commented Nov 28, 2023

For reporting the rows, I think one would just need to combine the knowledge of

def find_bad_rows(report, divisions):
    bad_rows = []
    for i, element in enumerate(report):
        # if element in the report arrays shows "bad" (e.g. the "exception" field of the default cause is a non-empty string)
        if element.exception:
            bad_rows.append((divisions[i], divisions[i+1]))
    return bad_rows

report_collection, array_collection = uproot.dask(..., report=True)
report, array = dask.compute(report_collection, array_collection)
bad_rows = find_bad_rows(report, array_collection.divisions)

Or, if the *args and **kwargs have any knowledge of eventual rows, that can be embedded in the consumer defined success/failure functions.

@lgray
Copy link
Collaborator

lgray commented Nov 28, 2023

Just kind of a cosmetic thing but can we have the return be:

array, report = uproot.dask(...)

that would match typetracer with report, etc. as far as consistency goes.

@douglasdavis
Copy link
Collaborator Author

Of course! this was just some shorthand -- I'm sure the PR to uproot will be be more thought out :)

@douglasdavis
Copy link
Collaborator Author

douglasdavis commented Nov 29, 2023

As this PR stands our column optimization breaks because the io_func is getting wrapped with our return_empty_on_raise function. The optimization check to see if something is projectable fails because of the wrapping. I'm trying to solve this now.

Solved

@douglasdavis
Copy link
Collaborator Author

@lgray this should be ready to get put to the fire by an implementation in uproot. I'll start drawing something up

@lgray
Copy link
Collaborator

lgray commented Nov 29, 2023

awesome! A parquet-based one would be cool too but is not as high-priority.

def io_func_implements_projection(func: ImplementsIOFunction) -> bool:
return hasattr(func, "prepare_for_projection")
return hasattr(maybe_unwrap(func), "prepare_for_projection")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to remove the unwrapping here, and require that the caller unwrap the IO function. That way it's a bit more explicit?

@agoose77
Copy link
Collaborator

agoose77 commented Nov 29, 2023

@douglasdavis looks good! I made some comments.

In general, do you think it's possible / sensible to move the raise_on_error machinery to additional protocols that io_func implements? I.e. rather than wrapping the IO function, we explicitly ask it if it supports these features? My hunch is that will be slightly simpler, and we could still provide a convenience adaptor that does this for plain-callable IO functions.

Hopefully this review doesn't feel like a lot of nitpicks and lastminute.com suggestions. I'm glad you've thought about this, because it's important to get right, and I'm just adding some without-much-context thoughts.

@douglasdavis
Copy link
Collaborator Author

@agoose77 thanks a lot for the comments.

In general, do you think it's possible / sensible to move the raise_on_error machinery to additional protocols that io_func implements? I.e. rather than wrapping the IO function, we explicitly ask it if it supports these features? My hunch is that will be slightly simpler, and we could still provide a convenience adaptor that does this for plain-callable IO functions.

I have thought a bit about this, especially with the from_map arguments starting to blow up a bit, and with the specific changes which had to be made to handle the new complexities of wrapped io_funcs. I'll keep ruminating on it a bit.

Hopefully this review doesn't feel like a lot of nitpicks and lastminute.com suggestions. I'm glad you've thought about this, because it's important to get right, and I'm just adding some without-much-context thoughts.

Not at all!

@douglasdavis
Copy link
Collaborator Author

douglasdavis commented Nov 29, 2023

The more I think about it, the more I like smooshing most of the logic into having to live in the io_func classes. I'll see if I can refactor this into doing that and keeping the from_map function signature more simple.

@douglasdavis douglasdavis force-pushed the failure-report branch 3 times, most recently from 54b4d74 to 6720aa7 Compare November 30, 2023 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants