-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally coerce names of maps and lists to match Parquet specification #6828
Conversation
Note this would supersede #6814. |
parquet/src/file/properties.rs
Outdated
/// Writers have the option to coerce these into native Parquet types. Type | ||
/// coercion allows for meaningful representations that do not require | ||
/// downstream readers to consider the embedded Arrow schema. However, type | ||
/// Also, for [`List`] and [`Map`] types, Parquet expects certain schema elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should most of this verbiage move to set_coerce_types
on the builder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or set_coerce_types could link to here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or set_coerce_types could link to here
I already made that change. My concern is that most of the deep documentation is currently on the setters rather than the getters...this one is an outlier.
Edit: I went ahead and moved the bulk of the documentation to the builder. I think more eyes will find it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a test using ArrowWriter to check everything is hooked up correctly might be valuable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// This type coercion allows for meaningful representations that do not require | ||
/// downstream readers to consider the embedded Arrow schema, and can allow for greater | ||
/// compatibility with other Parquet implementations. However, type | ||
/// coercion also prevents the data from being losslessly round-tripped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this wording too
I just restarted the failed jobs and will try and merge this PR in when complete |
Thanks @etseidl |
Thanks @etseidl for your work!!! This is gonna be published in the next minor release or it's considered a major change and will be out in v54? Given the fact it's an optional behaviour (we need to call |
The next release is scheduled to be a major one (54) so I expect this to be out in 54 and thus I haven't thought carefully if it would be classified as major/minor |
Which issue does this PR close?
Closes #6213.
Closes #6733.
Rationale for this change
Allow for forcing use of compliant names for Parquet
LIST
andMAP
fields.What changes are included in this PR?
Uses the
coerce_types
boolean property added in #6313 to control naming of elements ofLIST
andMAP
structures. Also adds acoerce_types
flag toparquet-rewrite
.Are there any user-facing changes?
No API changes, but change to behavior when
coerce_types
istrue
.