Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support returning empty arrays in from_map IO function calls based on user declared exceptions #400

Merged

Conversation

douglasdavis
Copy link
Collaborator

@douglasdavis douglasdavis commented Oct 31, 2023

Discussed on Slack, this is a way to give from_map callers the ability to provide a list of "allowed" exceptions that should just return an empty array. It requires the io_func to have a form attribute such that a correctly formed empty array can be instantiated. cc @lgray

To use this new feature there are two requirements:

  1. Write a mock_empty method (that takes a single argument that should be passed to ak.to_backend) in the class function passed to from_map. (The ColumnarProjectionMixin class in the lib.io.columnar module provides an example implementation).
  2. pass values to the empty_on_raise and empty_backend arguments of from_map. tests/test_io.py has example from_map calls that includes values for these arguments

@douglasdavis
Copy link
Collaborator Author

This currently lacks the ability to keep track of what arguments fail.

result = fn(*args, **kwargs)
return result
except allowed_exceptions:
return ak.Array(fn.form.length_zero_array(highlevel=False))
Copy link
Collaborator

@agoose77 agoose77 Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have the existing .mock interface: https://github.com/douglasdavis/dask-awkward/blob/a368106dd57c4a355022db42ab9f97f853e2f566/src/dask_awkward/layers/layers.py#L54-L56

I wonder if either we should extend this with another method mock_empty(), or whether we should require only mock and take the form from the mocked array: fn.mock().layout.form.length_zero_array()

One thing to be mindful of is that we should allow the implementer to choose the backend. Perhaps mock should take a backend argument instead...

Copy link
Collaborator Author

@douglasdavis douglasdavis Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I like the idea of reusing mock for this task, thanks for peeking at this

Copy link
Collaborator Author

@douglasdavis douglasdavis Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since mock has been around for a while and has been associated with typetracer-only parts of the code, perhaps it is best to go with a new mock_empty(backend: str) method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning the backend configurability: we should probably forbid the implementer to inject the typetracer backend as for this purpose it just doesn't make any sense?

@codecov-commenter
Copy link

codecov-commenter commented Oct 31, 2023

Codecov Report

Merging #400 (6e46fdf) into main (bde300c) will decrease coverage by 0.09%.
Report is 1 commits behind head on main.
The diff coverage is 91.07%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff             @@
##             main     #400      +/-   ##
==========================================
- Coverage   94.03%   93.94%   -0.09%     
==========================================
  Files          23       23              
  Lines        3066     3107      +41     
==========================================
+ Hits         2883     2919      +36     
- Misses        183      188       +5     
Files Coverage Δ
src/dask_awkward/lib/io/io.py 96.64% <100.00%> (+0.64%) ⬆️
src/dask_awkward/lib/testutils.py 100.00% <100.00%> (ø)
src/dask_awkward/lib/io/columnar.py 98.38% <66.66%> (-1.62%) ⬇️
src/dask_awkward/layers/layers.py 93.07% <60.00%> (-2.83%) ⬇️

... and 1 file with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

@douglasdavis
Copy link
Collaborator Author

Alright I've reworked the PR to use a new mock_empty method which is designed to be restricted to one of ("cpu", "jax", "cuda")

@douglasdavis
Copy link
Collaborator Author

We also need a way to communicate the backend to return_empty_on_raise

def mock_empty(self, backend: BackendT = "cpu") -> AwkwardArray:
import awkward as ak

if backend not in ("cpu", "jax", "cuda"):
Copy link
Collaborator

@lgray lgray Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, despite this being a bit of an abuse of a private interface you could make a set out of the keys of https://github.com/scikit-hep/awkward/blob/main/src/awkward/_backends/dispatch.py#L24
and subtract "typetracer" from the set and use that here. Should remove and problems coming from more backends over time.

@agoose77 may have issue with this though given it being hacky.

@douglasdavis douglasdavis force-pushed the allow-io-functions-to-raise branch 2 times, most recently from 64ad900 to 583fc94 Compare November 2, 2023 20:41
@douglasdavis douglasdavis marked this pull request as ready for review November 2, 2023 21:57
Comment on lines 465 to 472
@functools.wraps(fn)
def wrapped(*args, **kwargs):
try:
return fn(*args, **kwargs)
except allowed_exceptions:
return fn.mock_empty(backend)

return wrapped
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed adding a report mechanism here, so that the partitions can return a Report object (that perhaps the IO func can populate).

This would mean at a high level that we don't build an dak.Array from the io_func applications; rather, we need an unpacking step so that we build two subgraphs that depend upon the io_func result.

@douglasdavis douglasdavis changed the title feat: return empty array if from_map IO function raises certain exceptions feat: support returning empty arrays in from_map IO function calls based on user declared exceptions Nov 10, 2023
@douglasdavis douglasdavis merged commit f447df4 into dask-contrib:main Nov 10, 2023
22 of 23 checks passed
@douglasdavis
Copy link
Collaborator Author

douglasdavis commented Nov 10, 2023

@lgray with this PR merged we get graceful skips without well defined reporting. Two steps for enabling the feature are written out in the PR description.

I did add a log message such that

import logging
logging.basicConfig()
logging.getLogger("dask_awkward.lib.io.io").setLevel(logging.INFO)

will give a log message with some details about the calls that return an empty array. I consider this temporary and we can followup this PR with a better developed reporting mechanism

@douglasdavis douglasdavis deleted the allow-io-functions-to-raise branch November 10, 2023 17:01
@lgray
Copy link
Collaborator

lgray commented Nov 10, 2023

Great! This gives enough to prototype something that's a good analogue of what we have presently in coffea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants