Unclear how to manually do column projection with uproot.dask
(and API differences with dask-awkward
)
#1361
Replies: 9 comments 1 reply
-
(pinging @lgray @agoose77 @kkothari2001 for ideas, input and help 🙏 ) |
Beta Was this translation helpful? Give feedback.
-
There are different conventions for how the columns are named, and uproot encodes extra things in these names (because some columns are always required even when not in the output). The correct place to call project is probably on the layer, not the IO function, which provides a place to override the "awkward" column name to the "io convention" names. (This is something that the one-pass PR explicitly worked around, removing a lot of protocol classes and code in the process) |
Beta Was this translation helpful? Give feedback.
-
Thanks for the clarification. |
Beta Was this translation helpful? Give feedback.
-
@pfackeldey not much time answer in full here, but the different column projection conventions were intentional -- it reflects the different concepts of "column" between uproot, parquet, and form remapping! I would suggest not trying to remove that separation; it is a problem with the one-pass PR that tried to do so. Ultimately, column optimisation is really "Buffer Optimisation", and is a black-box for each array source. Will try to get to this. |
Beta Was this translation helpful? Give feedback.
-
Thank you @agoose77 for your reply! I'd argue though that:
|
Beta Was this translation helpful? Give feedback.
-
Apologies for terse replies: I'm in a meeting! (1) -- on the face of it, the Parquet example surprises me -- it's actually changing the type -- it should fail for |
Beta Was this translation helpful? Give feedback.
-
I realized the repr failure has been fixed in scikit-hep/awkward#3342. Rerunning the import uproot
io = uproot.dask({"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root": "Events"})
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x1670b73d0>
# 0. from-uproot-138b384738005b2a7a7eefbb600ca6c2
# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
lay.io_func = lay.io_func.project_keys(frozenset(["nJet"]))
# now compute, should only load `nJet`
array = io.compute()
print(array)
# <Array [{run: ??, ...}, ..., {run: ??, ...}] type='40 * {run: uint32, lumin...'>
print(array.nJet)
# <Array [5, 8, 5, 3, 5, 8, 4, 4, ..., 4, 9, 3, 2, 3, 1, 6, 2] type='40 * uint32'> So this works now nicely, where all fields are placeholder arrays, except for the one that I asked to be loaded, great! That means that This opens up now the possibility to add a function that accepts a set of IO-source columns to do the column projection "manually", which is what I was looking for in the very beginning. Apart from that, the difference the resulting form of (This issue can be closed now, I'll leave it open in case someone would like to comment on the above mentioned difference. Otherwise feel free to close it!) |
Beta Was this translation helpful? Give feedback.
-
@pfackeldey the need for Meanwhile, for uproot our reading is "stable" such that partial reads don't change the underlying form.
|
Beta Was this translation helpful? Give feedback.
-
@pfackeldey - I wonder if it's better to convert/move the issue to a discussion? |
Beta Was this translation helpful? Give feedback.
-
I'm currently looking into adjusting the dask graph layer for the IO to only read a given list of provided columns.
With uproot.dask this looks as follows:
(I have the impression that the underlying form is not updated accordingly here, or I'm using the projection interface wrongly?)
If I do this with parquet instead though, it works:
I don't understand why the above code example works for
dak.from_parquet
, but not foruproot.dask
, there seems to be a real difference in how the column projection is implemented for theio_func
of the dask layer.Apart from that, the APIs are very similar but also a bit misaligned between uproot vs dask-awkward (probably due to historic reasons), e.g.:
.project_keys()
vs.project_columns()
form_with_unique_keys
argument'<root>'
vs'@'
state
that holds the information of the trace is constructed differently: https://github.com/scikit-hep/uproot5/blob/main/src/uproot/_dask.py#L1082-L1084 vs https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py#L104There are probably some more that I've not yet encountered.
In principle, it would be nice if
uproot.dask
would adhere to the protocols defined here: https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py, to eliminate these differences. Some of this seems to be duplicated code inuproot._dask
aswell.I'm currently trying to find a way to unify the APIs and to find the reason of this difference here.
I'd appreciate any input how this should work/behave and how we can ensure that the APIs won't diverge in the future.
(If this API would be unified it would be rather easy to make
dak.project_columns
possible for allAwkwardInputLayer
kinds.)Beta Was this translation helpful? Give feedback.
All reactions