-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive style partitions #20
Conversation
This mostly follows the parquet implementation, minus schema merging and with a handful of small differences to account for the way zarr data is stored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, nice work! It's unfortunate so much stuff is copied from datafusion. I almost wonder opening a ticket / PR that just makes some of this public would be good (though not sure exactly what you changed).
My only big comment is it'd be nice to make sure the tests can be run on a fresh checkout vs having some of your file paths hardcoded.
src/datafusion/helpers.rs
Outdated
("other_var".to_string(), DataType::Utf8), | ||
]; | ||
|
||
let prefix = "home/max/Documents/repos/arrow-zarr/test-data/data/zarr/v2_data/lat_lon_w_groups_example.zarr"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to make this relative to the package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah I didn't mean to do that, good catch. I got sloppy and forgot about that line I probably added for some intermediate test I ran.
@@ -102,8 +102,18 @@ impl ExecutionPlan for ZarrScan { | |||
.runtime_env() | |||
.object_store(&self.base_config.object_store_url)?; | |||
|
|||
let config = | |||
ZarrConfig::new(object_store).with_projection(self.base_config.projection.clone()); | |||
// This is just replicating the `file_column_projection_indices` method on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
let test_data = get_test_v2_data_path("lat_lon_w_groups_example.zarr".to_string()); | ||
|
||
let sql = format!( | ||
"CREATE EXTERNAL TABLE zarr_table ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but adding slt tests might be nice at some point. DataFusion uses to remove some of this SQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm what's an slt test here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, should've included a link... https://github.com/risinglightdb/sqllogictest-rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh interesting... yeah probably worth looking into at some point.
src/datafusion/table_provider.rs
Outdated
pub fn new(config: ListingZarrTableConfig, table_schema: Schema) -> Self { | ||
Self { | ||
pub fn try_new(config: ListingZarrTableConfig) -> DataFusionResult<Self> { | ||
// TODO does options need to be an Option? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my fault, and I think you're probably right here. Somewhat recently I refactored code I had that followed this pattern to be simpler and not make options an Option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right okay, I think I made that comment because I didn't see a use case for setting it to None. I thought maybe there was a reason related to other datafusion features I didn't know about, but if it's not necessary I'll just remove it.
} else { | ||
TableProviderFilterPushDown::Unsupported | ||
TableProviderFilterPushDown::Inexact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this change is that everything gets pushed down, but only file schema fields are inexact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so if all the fields in a particular filter expression are partitions, then the filter can be applied exactly (i.e. datafusion won't bother applying that filter to what your reader/stream returns after the pushdown, if I understand correctly).
For fields that are not partitions, so fields that are in the file schema as you said, the zarr reader will try to skip reading whole chunk files if possible, to avoid the I/O, but that's it, it makes no guarantee that the result of that will satisfy the filter. So, again from my understanding, that's what Inexact means here, the filter will be pushed down to the reader, but datafusion will still apply it to what the reader returns.
src/datafusion/helpers.rs
Outdated
// directories under the last partition, those will all be zarr arrays, | ||
// and the last partition itself will be a store to be read atomically. | ||
if depth < max_depth { | ||
match futures.len() < CONCURRENCY_LIMIT { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, but perhaps make this an if
since it's just true or false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. yeah I copied this form datafusion, but I remember thinking that it was a bit much, considering that a good old fashioned if/else would do the trick.
Yeah, at some point, when I realized that I had to copy yet another thing, I considered just making a ticket to ask for some stuff to be made public. I'm not sure what the turn around time is though, when asking for something like that, I think there are a lot of open tickets for datafusion, not sure when they would get to a request like this. and yes, I did have to make some minor modifications to some functions too. In the end I decided against it, but I thought maybe I could come back to this in a while, when I can make a definitive (ish) list of functions and structs I would need to use (instead of e.g. making a ticket and then realizing I need even more functions and having to make another one). |
No description provided.