Unclear how to manually do column projection with `uproot.dask` (and API differences with `dask-awkward`) #1361

pfackeldey · 2024-12-12T23:27:53Z

pfackeldey
Dec 12, 2024
Maintainer

I'm currently looking into adjusting the dask graph layer for the IO to only read a given list of provided columns.

With uproot.dask this looks as follows:

import uproot

io = uproot.dask({"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root": "Events"})
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x1670b73d0>
#  0. from-uproot-138b384738005b2a7a7eefbb600ca6c2

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_keys(frozenset(["nJet"]))

# now compute, should only load `nJet`
io.compute()
# ... TypeError: PlaceholderArray supports only trivial slices, not int

(I have the impression that the underlying form is not updated accordingly here, or I'm using the projection interface wrongly?)

If I do this with parquet instead though, it works:

import dask_awkward as dak

io = dak.from_parquet("https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.parquet")
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x15c4ce1a0>
# 0. from-parquet-150809c2f6f63708200b7f130d3a395d

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_columns(frozenset(["nJet"]))
  
# now compute, should only load `nJet`
io.compute()
# <Array [{nJet: 5}, {nJet: 8}, ..., {...}, {nJet: 2}] type='40 * {nJet: uint32}'>

I don't understand why the above code example works for dak.from_parquet, but not for uproot.dask, there seems to be a real difference in how the column projection is implemented for the io_func of the dask layer.

Apart from that, the APIs are very similar but also a bit misaligned between uproot vs dask-awkward (probably due to historic reasons), e.g.:

.project_keys() vs .project_columns()
form_with_unique_keys argument '<root>' vs '@'
the state that holds the information of the trace is constructed differently: https://github.com/scikit-hep/uproot5/blob/main/src/uproot/_dask.py#L1082-L1084 vs https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py#L104

There are probably some more that I've not yet encountered.

In principle, it would be nice if uproot.dask would adhere to the protocols defined here: https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py, to eliminate these differences. Some of this seems to be duplicated code in uproot._dask aswell.

I'm currently trying to find a way to unify the APIs and to find the reason of this difference here.
I'd appreciate any input how this should work/behave and how we can ensure that the APIs won't diverge in the future.

(If this API would be unified it would be rather easy to make dak.project_columns possible for all AwkwardInputLayer kinds.)

pfackeldey · 2024-12-12T23:36:52Z

pfackeldey
Dec 12, 2024
Maintainer Author

(pinging @lgray @agoose77 @kkothari2001 for ideas, input and help 🙏 )

0 replies

martindurant · 2024-12-13T14:52:00Z

martindurant
Dec 13, 2024

There are different conventions for how the columns are named, and uproot encodes extra things in these names (because some columns are always required even when not in the output).

The correct place to call project is probably on the layer, not the IO function, which provides a place to override the "awkward" column name to the "io convention" names.

(This is something that the one-pass PR explicitly worked around, removing a lot of protocol classes and code in the process)

0 replies

pfackeldey · 2024-12-13T20:36:53Z

pfackeldey
Dec 13, 2024
Maintainer Author

Thanks for the clarification.
In current dask-awkward one would need to pass a report and state to layer.project(), which is far from user-friendly, especially since state is different for uproot.dask and dak.from_*.
So that's a really good thing about one-pass optimization that it can accept directly the column names inferred by dak.necessary_columns!

0 replies

agoose77 · 2024-12-13T20:54:28Z

agoose77
Dec 13, 2024
Collaborator

@pfackeldey not much time answer in full here, but the different column projection conventions were intentional -- it reflects the different concepts of "column" between uproot, parquet, and form remapping!

I would suggest not trying to remove that separation; it is a problem with the one-pass PR that tried to do so.

Ultimately, column optimisation is really "Buffer Optimisation", and is a black-box for each array source.

Will try to get to this.

0 replies

pfackeldey · 2024-12-13T21:02:47Z

pfackeldey
Dec 13, 2024
Maintainer Author

Thank you @agoose77 for your reply!
Ok, I understand that the concept of a "column" is different for uproot, parquet, etc., which is a good reason for the API differences.

I'd argue though that:

my example written in this issue should not fail for uproot
there should be a way to do the column projection with a list of strings (list of columns) for any kind of projectable IO layer (any format-specific complexity can be hidden inside the io_func/io_layer)

0 replies

agoose77 · 2024-12-13T21:47:26Z

agoose77
Dec 13, 2024
Collaborator

Apologies for terse replies: I'm in a meeting!

(1) -- on the face of it, the Parquet example surprises me -- it's actually changing the type -- it should fail for uproot because the repr is trying to view missing arrays. n.b., we talked about making placeholder reprs not throw errors.
(2) -- I think there's a naming problem -- the superset of optimisations is "buffer optimisation", so that should be the core API. If we want to support some notion of "column optimisation", it would need to sit on that.

0 replies

pfackeldey · 2025-01-14T15:14:10Z

pfackeldey
Jan 14, 2025
Maintainer Author

I realized the repr failure has been fixed in scikit-hep/awkward#3342.

Rerunning the uproot.dask example of this issue yields:

import uproot

io = uproot.dask({"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root": "Events"})
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x1670b73d0>
#  0. from-uproot-138b384738005b2a7a7eefbb600ca6c2

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_keys(frozenset(["nJet"]))

# now compute, should only load `nJet`
array = io.compute()
print(array)
# <Array [{run: ??, ...}, ..., {run: ??, ...}] type='40 * {run: uint32, lumin...'>

print(array.nJet)
# <Array [5, 8, 5, 3, 5, 8, 4, 4, ..., 4, 9, 3, 2, 3, 1, 6, 2] type='40 * uint32'>

So this works now nicely, where all fields are placeholder arrays, except for the one that I asked to be loaded, great!

That means that dak.from_parquet and uproot.dask behave the same way in terms of data loading now. The difference is the resulting array, where in the case of parquet the form is pruned down to only the IO-source column that has been asked for, and in the case of uproot the form is not pruned and every "leaf" in the layout (that does not correspond to "nJet") is a PlaceholderArray.

This opens up now the possibility to add a function that accepts a set of IO-source columns to do the column projection "manually", which is what I was looking for in the very beginning.

Apart from that, the difference the resulting form of dak.from_parquet and uproot.dask (pruned vs placeholder arrays) is not fully clear yet to me. Could someone elaborate on this a bit more why e.g. this pruning is necessary for parquet, maybe @agoose77, @martindurant or @jpivarski ?

(This issue can be closed now, I'll leave it open in case someone would like to comment on the above mentioned difference. Otherwise feel free to close it!)

0 replies

agoose77 · 2025-01-14T15:27:36Z

agoose77
Jan 14, 2025
Collaborator

@pfackeldey the need for unproject_layout is that the interface exposed by the arrow libraries only permits us to drop columns (and not read them). Then we end up with a new form from arrow. We implemented unproject_layout before placeholders existed IIRC, but ultimately it's just figuring out how to coerce the read data to a given (expected) form.

Meanwhile, for uproot our reading is "stable" such that partial reads don't change the underlying form.

~~I think a point of confusion here is that project_columns is not the same thing as the column optimisation — it naively reads only the required columns and doesn't do any unprojection, IIRC.~~

1 reply

agoose77 Jan 14, 2025
Collaborator

I had a brief look at this, and realised that project_columns does do unprojection. So it's on a similar level to project_keys, but scoped for columnar sources where "column" is the meaningful unit of data.

It looks like the confusion here stems from the way in which the two mechanisms are being tested:

compute() also runs column optimisation. Right now, this is not idempotent for Parquet, which should be fixed and the test should use compute(optimize_graph=False).
The IO function in the subgraph that is being computed is not the same object as that in io.dask.layers.

ianna · 2025-01-14T15:42:54Z

ianna
Jan 14, 2025
Maintainer

@pfackeldey - I wonder if it's better to convert/move the issue to a discussion?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclear how to manually do column projection with `uproot.dask` (and API differences with `dask-awkward`) #1361

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Unclear how to manually do column projection with uproot.dask (and API differences with dask-awkward) #1361

pfackeldey Dec 12, 2024 Maintainer

Replies: 9 comments · 1 reply

pfackeldey Dec 12, 2024 Maintainer Author

martindurant Dec 13, 2024

pfackeldey Dec 13, 2024 Maintainer Author

agoose77 Dec 13, 2024 Collaborator

pfackeldey Dec 13, 2024 Maintainer Author

agoose77 Dec 13, 2024 Collaborator

pfackeldey Jan 14, 2025 Maintainer Author

agoose77 Jan 14, 2025 Collaborator

agoose77 Jan 14, 2025 Collaborator

ianna Jan 14, 2025 Maintainer

Unclear how to manually do column projection with `uproot.dask` (and API differences with `dask-awkward`) #1361

pfackeldey
Dec 12, 2024
Maintainer

Replies: 9 comments 1 reply

pfackeldey
Dec 12, 2024
Maintainer Author

martindurant
Dec 13, 2024

pfackeldey
Dec 13, 2024
Maintainer Author

agoose77
Dec 13, 2024
Collaborator

pfackeldey
Dec 13, 2024
Maintainer Author

agoose77
Dec 13, 2024
Collaborator

pfackeldey
Jan 14, 2025
Maintainer Author

agoose77
Jan 14, 2025
Collaborator

agoose77 Jan 14, 2025
Collaborator

ianna
Jan 14, 2025
Maintainer