Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive style partitions #20

Merged
merged 6 commits into from
Jun 28, 2024
Merged

Hive style partitions #20

merged 6 commits into from
Jun 28, 2024

Conversation

maximedion2
Copy link
Collaborator

No description provided.

@maximedion2 maximedion2 requested a review from tshauck June 25, 2024 02:28
@maximedion2
Copy link
Collaborator Author

This mostly follows the parquet implementation, minus schema merging and with a handful of small differences to account for the way zarr data is stored.

@maximedion2 maximedion2 linked an issue Jun 25, 2024 that may be closed by this pull request
Copy link
Collaborator

@tshauck tshauck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, nice work! It's unfortunate so much stuff is copied from datafusion. I almost wonder opening a ticket / PR that just makes some of this public would be good (though not sure exactly what you changed).

My only big comment is it'd be nice to make sure the tests can be run on a fresh checkout vs having some of your file paths hardcoded.

("other_var".to_string(), DataType::Utf8),
];

let prefix = "home/max/Documents/repos/arrow-zarr/test-data/data/zarr/v2_data/lat_lon_w_groups_example.zarr";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to make this relative to the package.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah I didn't mean to do that, good catch. I got sloppy and forgot about that line I probably added for some intermediate test I ran.

@@ -102,8 +102,18 @@ impl ExecutionPlan for ZarrScan {
.runtime_env()
.object_store(&self.base_config.object_store_url)?;

let config =
ZarrConfig::new(object_store).with_projection(self.base_config.projection.clone());
// This is just replicating the `file_column_projection_indices` method on
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let test_data = get_test_v2_data_path("lat_lon_w_groups_example.zarr".to_string());

let sql = format!(
"CREATE EXTERNAL TABLE zarr_table (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but adding slt tests might be nice at some point. DataFusion uses to remove some of this SQL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm what's an slt test here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, should've included a link... https://github.com/risinglightdb/sqllogictest-rs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh interesting... yeah probably worth looking into at some point.

pub fn new(config: ListingZarrTableConfig, table_schema: Schema) -> Self {
Self {
pub fn try_new(config: ListingZarrTableConfig) -> DataFusionResult<Self> {
// TODO does options need to be an Option?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my fault, and I think you're probably right here. Somewhat recently I refactored code I had that followed this pattern to be simpler and not make options an Option.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right okay, I think I made that comment because I didn't see a use case for setting it to None. I thought maybe there was a reason related to other datafusion features I didn't know about, but if it's not necessary I'll just remove it.

} else {
TableProviderFilterPushDown::Unsupported
TableProviderFilterPushDown::Inexact
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this change is that everything gets pushed down, but only file schema fields are inexact?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so if all the fields in a particular filter expression are partitions, then the filter can be applied exactly (i.e. datafusion won't bother applying that filter to what your reader/stream returns after the pushdown, if I understand correctly).
For fields that are not partitions, so fields that are in the file schema as you said, the zarr reader will try to skip reading whole chunk files if possible, to avoid the I/O, but that's it, it makes no guarantee that the result of that will satisfy the filter. So, again from my understanding, that's what Inexact means here, the filter will be pushed down to the reader, but datafusion will still apply it to what the reader returns.

// directories under the last partition, those will all be zarr arrays,
// and the last partition itself will be a store to be read atomically.
if depth < max_depth {
match futures.len() < CONCURRENCY_LIMIT {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but perhaps make this an if since it's just true or false?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. yeah I copied this form datafusion, but I remember thinking that it was a bit much, considering that a good old fashioned if/else would do the trick.

@maximedion2
Copy link
Collaborator Author

Generally LGTM, nice work! It's unfortunate so much stuff is copied from datafusion. I almost wonder opening a ticket / PR that just makes some of this public would be good (though not sure exactly what you changed).

My only big comment is it'd be nice to make sure the tests can be run on a fresh checkout vs having some of your file paths hardcoded.

Yeah, at some point, when I realized that I had to copy yet another thing, I considered just making a ticket to ask for some stuff to be made public. I'm not sure what the turn around time is though, when asking for something like that, I think there are a lot of open tickets for datafusion, not sure when they would get to a request like this. and yes, I did have to make some minor modifications to some functions too. In the end I decided against it, but I thought maybe I could come back to this in a while, when I can make a definitive (ish) list of functions and structs I would need to use (instead of e.g. making a ticket and then realizing I need even more functions and having to make another one).

@maximedion2 maximedion2 merged commit dbb3e42 into main Jun 28, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support hive style listing / partitioning
2 participants