-
Dear experts, (apologies if this is a duplicate question/issue -- I looked around but found nothing, so I decided to open a new discussion) We are developing a new setup for analyses in our group which heavily relies on awkward and its nice features! My question particularly revolves around the read/write capabilities of awkward to parquet. We'd like to load only specific columns from nested parquet file structures that were created used While poking around, I found this issue from last October where Jim already tried exactly the syntax from above and noticed that the Now here is my question: Are there ongoing efforts to include loading specific column from nested parquet files? Or is the plan for |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
I'll wait for another developer who's worked on the V2 Parquet interface to give a more concrete answer, but from glancing at the source, I'd wager that we will support nested fields in the manner that you describe. As to the PyArrow bug, assuming that it hasn't been fixed, that would be a show-stopper. However, it looks like we have a work-around already: a719f8c, which I assume is covered by this test: https://github.com/scikit-hep/awkward/blob/a719f8ccac3af84e436ecc366cc0690e53bda6fa/tests/v2/test_0593-preserve-nullability-in-arrow-and-parquet.py I haven't tested any of this, but you can! If you install the latest RC, we have the V2 API under |
Beta Was this translation helpful? Give feedback.
-
You're right, @agoose77, that the version 2 implementation has a lot more options for limiting the read—motivated by the fact that v1 would often be used with lazy arrays, but now we're separating the laziness from Awkward into dask-awkward and the non-Dask Awkward function will need to be more tuneable. Here's how to do a projected read. I'm importing Awkward as >>> import awkward as ak so that all of my " In order to know what columns there are, to know what to ask for, we can get a metadata object: >>> filename = "https://pivarski-princeton.s3.amazonaws.com/chicago-taxi.parquet"
>>> metadata = ak._v2.metadata_from_parquet(filename) This has a number of things, including Parquet's own metadata, with useful fields, such as >>> metadata.form.type.show()
var * {
trip: {
sec: ?float32,
km: ?float32,
begin: {
lon: ?float64,
lat: ?float64,
time: ?datetime64[ms]
},
end: {
lon: ?float64,
lat: ?float64,
time: ?datetime64[ms]
},
path: var * {
londiff: float32,
latdiff: float32
}
},
payment: {
fare: ?float32,
tips: ?float32,
total: ?float32,
type: var * char
},
company: var * char
} This is a deeply nested view of the type. Parquet likes to think of data as being members of columns, which are the leaves of this tree. They have names with dots: >>> metadata.form.columns()
['trip.sec',
'trip.km',
'trip.begin.lon',
'trip.begin.lat',
'trip.begin.time',
'trip.end.lon',
'trip.end.lat',
'trip.end.time',
'trip.path.londiff',
'trip.path.latdiff',
'payment.fare',
'payment.tips',
'payment.total',
'payment.type',
'company'] You can select a few of these (single string/list of strings/possibly with wildcards) from the Form, with projects the type to have only the columns you pick: >>> metadata.form.select_columns(["trip.path.*diff", "payment.type"]).type.show()
var * {
trip: {
path: var * {
londiff: float32,
latdiff: float32
}
},
payment: {
type: var * char
}
} Passing this to the >>> array = ak._v2.from_parquet(filename, columns=["trip.path.*diff", "payment.type"], row_groups=[0]) The array we pulled has only the columns we asked for. >>> array.show()
[[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
...,
[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
[{trip: {path: []}, payment: {type: ..., ...}}, ..., {trip: {...}, ...}],
[{trip: {path: []}, payment: {type: ..., ...}}, ..., {trip: {...}, ...}],
[{trip: {path: []}, payment: {type: ..., ...}}, ..., {trip: {...}, ...}],
[{trip: {path: []}, payment: {type: ..., ...}}, ..., {trip: {...}, ...}],
[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
[{trip: {path: []}, payment: {type: 'Cash'}}, {...}, ..., {trip: {...}, ...}],
[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}]]
>>> array.type.show()
353 * var * ?{
trip: {
path: var * {
londiff: float32,
latdiff: float32
}
},
payment: {
type: string
}
} I think the effect of ARROW-14485 is to make the taxi records nullable, when they weren't nullable in the original file. I also think there's something wrong with the metadata form presenting the But these issues won't stop data analysis: >>> array.trip.path.londiff * 100
<Array [[[-0.00241, -0.00241], ...], ...] type='353 * var * option[var * fl...'> (See upcoming tutorial for more on this file.) |
Beta Was this translation helpful? Give feedback.
-
Sorry for the late reply! Thanks for the explanation and pointing me to |
Beta Was this translation helpful? Give feedback.
You're right, @agoose77, that the version 2 implementation has a lot more options for limiting the read—motivated by the fact that v1 would often be used with lazy arrays, but now we're separating the laziness from Awkward into dask-awkward and the non-Dask Awkward function will need to be more tuneable.
Here's how to do a projected read. I'm importing Awkward as
so that all of my "
._v2
"s are explicit, but you could importawkward._v2 as ak
.In order to know what columns there are, to know what to ask for, we can get a metadata object: