Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support placeholder arrays in pickling #366

Merged
merged 5 commits into from
Sep 18, 2023

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Sep 17, 2023

@douglasdavis this PR refactors some code to aid understanding of the logic, and ensures that we don't try to pickle non-buffers using pickle protocol 5.

@codecov-commenter
Copy link

codecov-commenter commented Sep 17, 2023

Codecov Report

Merging #366 (c7165e7) into main (b60a25e) will decrease coverage by 0.05%.
The diff coverage is 93.18%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff             @@
##             main     #366      +/-   ##
==========================================
- Coverage   93.87%   93.82%   -0.05%     
==========================================
  Files          23       23              
  Lines        2907     2917      +10     
==========================================
+ Hits         2729     2737       +8     
- Misses        178      180       +2     
Files Changed Coverage Δ
src/dask_awkward/pickle.py 87.87% <87.50%> (-1.41%) ⬇️
src/dask_awkward/lib/optimize.py 95.79% <93.93%> (-0.38%) ⬇️
src/dask_awkward/lib/_utils.py 89.47% <100.00%> (+1.23%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Comment on lines +440 to +445
if column.endswith(LIST_KEY):
list_parent_path = column[: -(len(LIST_KEY) + 1)].rstrip(".")
if list_parent_path not in necessary_columns:
necessary_columns.append(f"{list_parent_path}.*")
else:
necessary_columns.append(column)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@douglasdavis this is where I was initially thinking that we need to do something for shape_touched. However, it might prove fiddly.

Naively, I thought about A.* if we want the shape of A.B. However, this wouldn't work for the following type of A

1 * {
    x: var * int64,
    y: var * int64
}

Computing A.y wouldn't give any information about the length of the NumpyArray in A.x.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to follow up (I don't expect you to have seen this over the weekend) — this is the blocker for @lgray's demo this week AFAICT.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I'm understanding: are you thinking that there's still a another case we need to add to this part of the code? As things stand right now I'm able to follow the changes and they looks good to me (but I don't know with certainty that all of our cases are covered)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR fixes the pickling problem, but there's still the task of loading lengths for buffers with shape_touched. I'll move this to Slack :)

@agoose77
Copy link
Collaborator Author

@agoose77 agoose77 marked this pull request as ready for review September 18, 2023 11:56
@agoose77 agoose77 changed the title fix: pickle placeholder fix: support placeholder arrays in pickling Sep 18, 2023
Copy link
Collaborator

@douglasdavis douglasdavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things look good to me and tests are passing, just had that one question on the inline comment

@ikrommyd
Copy link
Contributor

It doesn't do it for me.

The reproducer in #363:

import dask
import uproot
import dask_awkward as dak
events = uproot.dask({"/path/to/coffea/tests/samples/nano_dy.root": "Events"})
p1 = events.Electron_pt[:20]
p2 = events.Electron_pt[20:]
dask.compute(p1, p2, scheduler="processes")

fails with TypeError: internal_error: the _UnknownLength class should never be directly instantiated, and my pytest looks like this:

FAILED tests/test_tnp.py::test_local_compute[False-processes] - TypeError: internal_error: the _UnknownLength class should never be directly instantiated
FAILED tests/test_tnp.py::test_local_compute[True-processes] - TypeError: internal_error: the _UnknownLength class should never be directly instantiated
FAILED tests/test_tnp.py::test_distributed_compute[False] - TypeError: cannot order unknown lengths
FAILED tests/test_tnp.py::test_distributed_compute[True] - TypeError: cannot order unknown lengths
import dask
import uproot
import dask_awkward as dak
from distributed import Client
events = uproot.dask({"/path/to/coffea/tests/samples/nano_dy.root": "Events"})
p1 = events.Electron_pt[:20]
p2 = events.Electron_pt[20:]
with Client():
    dask.compute(p1, p2)

This snippet wil give you this: TypeError: cannot order unknown lengths (same as the pytest)

@agoose77
Copy link
Collaborator Author

@iasonkrom you need the latest Awkward for this (HEAD)

@ikrommyd
Copy link
Contributor

Yeah my bad, however,

first snippet now gives:

TypeError: PlaceholderArray supports only trivial slices, not int

This error occurred while calling

    ak.from_buffers(
        {'class': 'RecordArray', 'fields': ['run', 'luminosityBlock', 'event'...
        40
        {'node1-data': <awkward._nplikes.placeholder.PlaceholderArray object ...
        highlevel = False
        buffer_key = '{form_key}-{attribute}'
        byteorder = '<'
    )

and the second snippet still the same error:

Traceback (most recent call last):
  File "/Users/iason/fun/egamma_dev/egamma-tnp/test2.py", line 10, in <module>
    dask.compute(p1, p2)
  File "/Users/iason/fun/egamma_dev/venv/lib/python3.10/site-packages/dask/base.py", line 628, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/Users/iason/fun/egamma_dev/venv/lib/python3.10/site-packages/distributed/spill.py", line 181, in __setitem__
    super().__setitem__(key, value)
  File "/Users/iason/fun/egamma_dev/venv/lib/python3.10/site-packages/zict/buffer.py", line 193, in __setitem__
    self.fast[key] = value
  File "/Users/iason/fun/egamma_dev/venv/lib/python3.10/site-packages/zict/lru.py", line 136, in __setitem__
    self.set_noevict(key, value)
  File "/Users/iason/fun/egamma_dev/venv/lib/python3.10/site-packages/zict/common.py", line 127, in wrapper
    return func(*args, **kwargs)
  File "/Users/iason/fun/egamma_dev/venv/lib/python3.10/site-packages/zict/lru.py", line 163, in set_noevict
    if weight > self.n:
  File "/Users/iason/fun/egamma_dev/venv/lib/python3.10/site-packages/awkward/_nplikes/shape.py", line 67, in __gt__
    raise TypeError("cannot order unknown lengths")
TypeError: cannot order unknown lengths

@agoose77
Copy link
Collaborator Author

@iasonkrom could you use scikit-hep/awkward#2714 instead of HEAD for awkward? I might have misremembered with PR you should use :)

@ikrommyd
Copy link
Contributor

Yup, the branch works fine. Both snippets do not raise an error and pytests pass

@agoose77 agoose77 merged commit 3125bf4 into main Sep 18, 2023
20 checks passed
@douglasdavis
Copy link
Collaborator

I'll put out a release with this fix in 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants