Load specific columns from parquet files #1522

pkausw · 2022-06-30T13:39:17Z

pkausw
Jun 30, 2022

Dear experts,

(apologies if this is a duplicate question/issue -- I looked around but found nothing, so I decided to open a new discussion)

We are developing a new setup for analyses in our group which heavily relies on awkward and its nice features! My question particularly revolves around the read/write capabilities of awkward to parquet. We'd like to load only specific columns from nested parquet file structures that were created used awkward.to_parquet. Right now, the awkward.from_parquet function only supports names of "top-level" fields (such as Jet) but not something more specific, e.g. Jet.list.item.pt as it is supported by the parquet format.

While poking around, I found this issue from last October where Jim already tried exactly the syntax from above and noticed that the nullable argument in the underlying pyarrow struct is not correctly preserved by the pyarrow.parquet.ParequetFile.read_row_group function.

Now here is my question: Are there ongoing efforts to include loading specific column from nested parquet files? Or is the plan for awkward to wait for a fix of this issue on the pyarrow side? My naive assumption would be that the loss of the nullable information doesn't break anything, right? Just trying it out myself, the resulting awkward array is usable the same way, though the underlying layout changed of course. (Again, this is very naive, no idea about the effects this might have downstream)

Answered by jpivarski

Jun 30, 2022

You're right, @agoose77, that the version 2 implementation has a lot more options for limiting the read—motivated by the fact that v1 would often be used with lazy arrays, but now we're separating the laziness from Awkward into dask-awkward and the non-Dask Awkward function will need to be more tuneable.

Here's how to do a projected read. I'm importing Awkward as

>>> import awkward as ak

so that all of my "._v2"s are explicit, but you could import awkward._v2 as ak.

In order to know what columns there are, to know what to ask for, we can get a metadata object:

>>> filename = "https://pivarski-princeton.s3.amazonaws.com/chicago-taxi.parquet"
>>> metadata = ak._v2.metadata_from_parquet(file…

View full answer

agoose77 · 2022-06-30T13:50:23Z

agoose77
Jun 30, 2022
Maintainer

I'll wait for another developer who's worked on the V2 Parquet interface to give a more concrete answer, but from glancing at the source, I'd wager that we will support nested fields in the manner that you describe.

As to the PyArrow bug, assuming that it hasn't been fixed, that would be a show-stopper. However, it looks like we have a work-around already: a719f8c, which I assume is covered by this test: https://github.com/scikit-hep/awkward/blob/a719f8ccac3af84e436ecc366cc0690e53bda6fa/tests/v2/test_0593-preserve-nullability-in-arrow-and-parquet.py

I haven't tested any of this, but you can! If you install the latest RC, we have the V2 API under awkward._v2. If we've implemented this support, you could try it out.

0 replies

jpivarski · 2022-06-30T15:23:45Z

jpivarski
Jun 30, 2022
Maintainer

You're right, @agoose77, that the version 2 implementation has a lot more options for limiting the read—motivated by the fact that v1 would often be used with lazy arrays, but now we're separating the laziness from Awkward into dask-awkward and the non-Dask Awkward function will need to be more tuneable.

Here's how to do a projected read. I'm importing Awkward as

>>> import awkward as ak

so that all of my "._v2"s are explicit, but you could import awkward._v2 as ak.

In order to know what columns there are, to know what to ask for, we can get a metadata object:

>>> filename = "https://pivarski-princeton.s3.amazonaws.com/chicago-taxi.parquet"
>>> metadata = ak._v2.metadata_from_parquet(filename)

This has a number of things, including Parquet's own metadata, with useful fields, such as metadata.metadata.num_rows (number of entries) and metadata.metadata.num_row_groups (number of individually readable chunks). But for now, let's look at the type, which is a property of metadata.form.

>>> metadata.form.type.show()
var * {
    trip: {
        sec: ?float32,
        km: ?float32,
        begin: {
            lon: ?float64,
            lat: ?float64,
            time: ?datetime64[ms]
        },
        end: {
            lon: ?float64,
            lat: ?float64,
            time: ?datetime64[ms]
        },
        path: var * {
            londiff: float32,
            latdiff: float32
        }
    },
    payment: {
        fare: ?float32,
        tips: ?float32,
        total: ?float32,
        type: var * char
    },
    company: var * char
}

This is a deeply nested view of the type. Parquet likes to think of data as being members of columns, which are the leaves of this tree. They have names with dots:

>>> metadata.form.columns()
['trip.sec',
 'trip.km',
 'trip.begin.lon',
 'trip.begin.lat',
 'trip.begin.time',
 'trip.end.lon',
 'trip.end.lat',
 'trip.end.time',
 'trip.path.londiff',
 'trip.path.latdiff',
 'payment.fare',
 'payment.tips',
 'payment.total',
 'payment.type',
 'company']

You can select a few of these (single string/list of strings/possibly with wildcards) from the Form, with projects the type to have only the columns you pick:

>>> metadata.form.select_columns(["trip.path.*diff", "payment.type"]).type.show()
var * {
    trip: {
        path: var * {
            londiff: float32,
            latdiff: float32
        }
    },
    payment: {
        type: var * char
    }
}

Passing this to the from_parquet function reduces the amount of data you need to read. So does limiting the set of row groups. (This is a half-GB file, and we're pulling the largest columns.)

>>> array = ak._v2.from_parquet(filename, columns=["trip.path.*diff", "payment.type"], row_groups=[0])

The array we pulled has only the columns we asked for.

>>> array.show()
[[{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
 [{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
 [{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
 [{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
 [{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
 [{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
 [{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
 [{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
 [{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
 [{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
 ...,
 [{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
 [{trip: {path: []}, payment: {type: ..., ...}}, ..., {trip: {...}, ...}],
 [{trip: {path: []}, payment: {type: ..., ...}}, ..., {trip: {...}, ...}],
 [{trip: {path: []}, payment: {type: ..., ...}}, ..., {trip: {...}, ...}],
 [{trip: {path: []}, payment: {type: ..., ...}}, ..., {trip: {...}, ...}],
 [{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
 [{trip: {path: []}, payment: {type: 'Cash'}}, {...}, ..., {trip: {...}, ...}],
 [{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}],
 [{trip: {path: [{...}, ...]}, payment: {...}}, {...}, ..., {trip: {...}, ...}]]
>>> array.type.show()
353 * var * ?{
    trip: {
        path: var * {
            londiff: float32,
            latdiff: float32
        }
    },
    payment: {
        type: string
    }
}

I think the effect of ARROW-14485 is to make the taxi records nullable, when they weren't nullable in the original file. I also think there's something wrong with the metadata form presenting the payment.type as var * char, rather than string, as it's supposed to be.

But these issues won't stop data analysis:

>>> array.trip.path.londiff * 100
<Array [[[-0.00241, -0.00241], ...], ...] type='353 * var * option[var * fl...'>

(See upcoming tutorial for more on this file.)

0 replies

pkausw · 2022-07-06T12:27:57Z

pkausw
Jul 6, 2022
Author

Sorry for the late reply! Thanks for the explanation and pointing me to awkward._v2 and the tutorial! This answers my question -- I think we'll either life with loading 'to many' column for now or implement something in the current version of awkward until version 2.0 goes live 👍

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load specific columns from parquet files #1522

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Load specific columns from parquet files #1522

pkausw Jun 30, 2022

Replies: 3 comments

agoose77 Jun 30, 2022 Maintainer

jpivarski Jun 30, 2022 Maintainer

pkausw Jul 6, 2022 Author

pkausw
Jun 30, 2022

agoose77
Jun 30, 2022
Maintainer

jpivarski
Jun 30, 2022
Maintainer

pkausw
Jul 6, 2022
Author