-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to UTF-8 statistics truncation #6870
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @etseidl -- the tests in this PR are amazing
I am not sure about the logic to increment the utf8 bytes. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl -- I went through this logic quite carefully and I think it looks great.
- It only copies the string values once (like the current code)
- I am convinced it is correct ❤️ (hopefully those will not be famous last words)
So all in all, thank you and great job. Thank you
Err(_) => Some(data[..l].to_vec()), | ||
}) | ||
.and_then(|l| | ||
// don't do extra work if this column isn't UTF-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
if self.is_utf8() { | ||
match str::from_utf8(data) { | ||
Ok(str_data) => truncate_utf8(str_data, l), | ||
Err(_) => Some(data[..l].to_vec()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a somewhat questionable move to truncate this on invalid data, but I see that is wht the code used to do so seems good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. The old code simply tried utf first, and then fell back. Here we're actually expecting valid UTF8 so perhaps it's better to return an error. I'd hope some string validation was done before getting this far. I'll think on this some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should leave it as is and maybe document that if non utf8 data is passed in it will be truncated with bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've managed a test that exercises this path via the SerializedFileWriter
API. It truncates as expected (i.e. as binary, not UTF-8), but now I worry that it's possible to create invalid data. Oddly both parquet-java and pyarrow seem ok with non-utf8 string data.
To paraphrase a wise man I know: Every day I wake up. And then I remember Parquet exists. 🫤
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left the logic here as is, but added documentation and a test. We can revisit if this ever becomes an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To paraphrase a wise man I know: Every day I wake up. And then I remember Parquet exists. 🫤
I solace myself with this quote from a former coworker:
"Legacy Code, n: code that is getting the job done, and pretty well at that"
Not that we can't / shouldn't improve it of course 🤣
thanks again for all the help here
/// Returns `true` if this column's logical type is a UTF-8 string. | ||
fn is_utf8(&self) -> bool { | ||
self.get_descriptor().logical_type() == Some(LogicalType::String) | ||
|| self.get_descriptor().converted_type() == ConvertedType::UTF8 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this works for dictionary encoded columns as well right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, regardless of the encoding, the statistics are for the data itself. You wouldn't see a dictionary key here.
* fix a few edge cases with utf-8 incrementing * add todo * simplify truncation * add another test * note case where string should render right to left * rework entirely, also avoid UTF8 processing if not required by the schema * more consistent naming * modify some tests to truncate in the middle of a multibyte char * add test and docstring * document truncate_min_value too
Which issue does this PR close?
Closes #6867.
Rationale for this change
See issue.
What changes are included in this PR?
For max statistics replaces
truncate_utf8().and_then(increment_utf8)
with a new functiontruncate_and_increment_utf8()
. This defers the creation of a newVec<u8>
until all processing is complete. This also changesincrement_utf8
to operate on entire unicode code points rather than doing arithmetic on the individual UTF-8 encoded bytes. Finally, this modifies the truncation logic so that UTF-8 handling is only done for columns whose logical type isString
(or converted typeUTF8
).The new increment logic is up to 30X faster for pathological cases of strings that cannot be truncated, and is no slower than the current code for simple cases where only the last byte of a string needs to be incremented.
Are there any user-facing changes?
No API changes, but will potentially produce different truncated max statistics.