Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster parquet utf8 validation using simdutf8 #6668

Merged
merged 3 commits into from
Jan 14, 2025
Merged

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Oct 31, 2024

Which issue does this PR close?

Closes #6667

Rationale for this change

Improves performance for about 4-5% (on M1 Pro) on strings (plain encoding):

arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs
                        time:   [740.81 µs 746.51 µs 752.11 µs]
                        change: [-5.8127% -5.2637% -4.6414%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringArray/plain encoded, optional, no NULLs
                        time:   [743.62 µs 748.70 µs 754.14 µs]
                        change: [-4.2825% -3.6551% -3.0212%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringArray/plain encoded, optional, half NULLs
                        time:   [633.43 µs 638.47 µs 643.71 µs]
                        change: [-5.1930% -4.5414% -3.8189%] (p = 0.00 < 0.05)

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 31, 2024
Ok(_) => Ok(()),
Err(e) => Err(general_err!("encountered non UTF-8 data: {}", e)),
Err(_) => {
let e = simdutf8::compat::from_utf8(val).unwrap_err();
Copy link
Contributor Author

@Dandandan Dandandan Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call compat from_utf8 again to get the same error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the role of simdutf8::basic::from_utf8 and re-run with simdutf8::compat -- does this deserve a code comment?

(at least the .unwrap_err() safety deserves one)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same in offset_buffer.rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree deserves some comments explaining why we rerun it in case of error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a positive sentiment about using simdutf8 for faster validation, I can do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have our own from_utf8 that wraps the simdutf8 implementation? Then the weird basic/compat path would only need to be documented once (and make it easier to replace other from_utf8 invocations that @alamb identified).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a PR that does this in the next few days if no one beats me to it

@@ -69,6 +69,7 @@ paste = { version = "1.0" }
half = { version = "2.1", default-features = false, features = ["num-traits"] }
sysinfo = { version = "0.32.0", optional = true, default-features = false, features = ["system"] }
crc32fast = { version = "1.4.2", optional = true, default-features = false }
simdutf8 = { version = "0.1.5"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be optional as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How mature is the library and its dependencies?
My random spike led me to https://github.com/rusticstuff/simdutf8/blob/main/src/implementation/aarch64/neon.rs#L16 and https://docs.rs/flexpect/latest/flexpect/ lacks documentation.
Should we help simdutf8 to bring it to arrow's maturity level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it is just some macro helper for clippy split off as crate / dependency. Doesn't seem too bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: given the number of downloads (28M) and use by widely popular libraries (e.g polars) in mt opinion the simdutf8 crate is sufficiently mature for inclusion. More info in this comment

@tustvold
Copy link
Contributor

I'm not sure that 5% really justifies an additional dependency, especially one that uses so much unsafe...

@Dandandan
Copy link
Contributor Author

I'm not sure that 5% really justifies an additional dependency, especially one that uses so much unsafe...

Hm yeah wondering about that.

I think that 5% speed up for Parquet might be quite valuable though, given that it often translates in close to 5% faster query execution for queries where Parquet scan is a bottleneck (quite some DF benchmarks actually involving string data).

@Dandandan
Copy link
Contributor Author

Dandandan commented Nov 1, 2024

FWIW some other projects are using simdutf8 as well, like polars https://github.com/pola-rs/polars/blob/main/Cargo.toml#L77 and simd-json

@alamb
Copy link
Contributor

alamb commented Nov 7, 2024

I am not sure exactly the usecase here, but what about simply disabling utf8 validation for known good data?

@Dandandan
Copy link
Contributor Author

Dandandan commented Nov 8, 2024

I am not sure exactly the usecase here, but what about simply disabling utf8 validation for known good data?

The "use case" of this PR is just that utf8 validation takes time, this PR improves the performance.

I think having a option to disable it makes sense, but would be good to minimize the cost of validation as well.

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

So what shall we do with this PR? Make it an optional opt-in feature of the parquet crate that people can enable if they want more performance?

@doki23
Copy link
Contributor

doki23 commented Dec 24, 2024

I believe this PR is solely for performance enhancement. Introducing an optional opt-in feature deserves a separate PR.

@XiangpengHao
Copy link
Contributor

Improves performance for about 4-5% (on M1 Pro) on strings (plain encoding)

Coming from #6921 (comment), I have seen much larger performance (~15%) improvements using simdutf8 with StringViewArray + x86, especially when strings are long (>128 byte).

@alamb
Copy link
Contributor

alamb commented Jan 1, 2025

What I suggest we do with this PR is get some end to end performance numbers (aka run the DataFusion clickbench benchmark with a pinned arrow version with this change)

Assuming that looks promising I recommend creating a PR that has an optional feature (enabled by default) for using simdjson for utf8 validation.

@alamb alamb changed the title Faster utf8 validation Faster parquet utf8 validation using simdjson Jan 1, 2025
@XiangpengHao
Copy link
Contributor

XiangpengHao commented Jan 2, 2025

This is my benchmark results with Clickbench non-partitioned and filter pushdown enabled. Benchmarked on x86 AMD 9900x. Some scan-dominate queries can get 20% improvements.

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ no-simd-utf8 ┃ simd-utf8 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │       0.35ms │    0.36ms │     no change │
│ QQuery 1     │      50.50ms │   50.15ms │     no change │
│ QQuery 2     │      68.05ms │   67.37ms │     no change │
│ QQuery 3     │      89.46ms │   88.81ms │     no change │
│ QQuery 4     │     463.69ms │  459.30ms │     no change │
│ QQuery 5     │     527.10ms │  481.84ms │ +1.09x faster │
│ QQuery 6     │      50.25ms │   50.92ms │     no change │
│ QQuery 7     │      55.59ms │   55.34ms │     no change │
│ QQuery 8     │     562.75ms │  561.54ms │     no change │
│ QQuery 9     │     580.23ms │  580.26ms │     no change │
│ QQuery 10    │     159.95ms │  155.25ms │     no change │
│ QQuery 11    │     172.33ms │  168.08ms │     no change │
│ QQuery 12    │     730.38ms │  653.11ms │ +1.12x faster │
│ QQuery 13    │    1205.87ms │ 1100.43ms │ +1.10x faster │
│ QQuery 14    │     739.05ms │  666.65ms │ +1.11x faster │
│ QQuery 15    │     550.73ms │  552.73ms │     no change │
│ QQuery 16    │    1188.05ms │ 1159.64ms │     no change │
│ QQuery 17    │    1159.66ms │ 1146.29ms │     no change │
│ QQuery 18    │    2641.59ms │ 2652.32ms │     no change │
│ QQuery 19    │      81.54ms │   90.09ms │  1.10x slower │
│ QQuery 20    │     610.90ms │  589.75ms │     no change │
│ QQuery 21    │     705.48ms │  663.11ms │ +1.06x faster │
│ QQuery 22    │    1659.48ms │ 1357.19ms │ +1.22x faster │
│ QQuery 23    │    3534.78ms │ 3639.67ms │     no change │
│ QQuery 24    │     533.81ms │  475.18ms │ +1.12x faster │
│ QQuery 25    │     470.34ms │  373.89ms │ +1.26x faster │
│ QQuery 26    │     576.26ms │  476.26ms │ +1.21x faster │
│ QQuery 27    │    1121.43ms │ 1056.52ms │ +1.06x faster │
│ QQuery 28    │    4291.35ms │ 4288.26ms │     no change │
│ QQuery 29    │     228.93ms │  235.60ms │     no change │
│ QQuery 30    │     589.98ms │  541.70ms │ +1.09x faster │
│ QQuery 31    │     716.59ms │  702.51ms │     no change │
│ QQuery 32    │    2593.06ms │ 2528.18ms │     no change │
│ QQuery 33    │    2362.52ms │ 2336.75ms │     no change │
│ QQuery 34    │    2360.44ms │ 2334.17ms │     no change │
│ QQuery 35    │     705.78ms │  696.32ms │     no change │
│ QQuery 36    │     163.69ms │  156.20ms │     no change │
│ QQuery 37    │     129.14ms │   97.25ms │ +1.33x faster │
│ QQuery 38    │      88.72ms │   90.76ms │     no change │
│ QQuery 39    │     218.61ms │  213.61ms │     no change │
│ QQuery 40    │      71.73ms │   73.40ms │     no change │
│ QQuery 41    │      68.23ms │   67.56ms │     no change │
│ QQuery 42    │      69.22ms │   68.76ms │     no change │
└──────────────┴──────────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (no-simd-utf8)   │ 34947.62ms │
│ Total Time (simd-utf8)      │ 33803.08ms │
│ Average Time (no-simd-utf8) │   812.74ms │
│ Average Time (simd-utf8)    │   786.12ms │
│ Queries Faster              │         12 │
│ Queries Slower              │          1 │
│ Queries with No Change      │         30 │
└─────────────────────────────┴────────────┘

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Dandandan and @findepi, @XiangpengHao @doki23 and @tustvold

@etseidl I wonder if you have any thoughts on this PR?

I double checked that this PR catches the important validation path in parquet. There are some other places where utf8 is validated, but they seem like they are relatively

https://github.com/search?q=repo%3Aapache%2Farrow-rs+from_utf8+path%3A%2F%5Eparquet%5C%2Fsrc%5C%2F%2F&type=code

It also appears this library is used by polars which gives me some confidence it is stable and will have community support if there are issues: https://crates.io/crates/simdutf8/reverse_dependencies

Screenshot 2025-01-08 at 6 09 24 PM

Thus I think we should proceed and add a flag to disable the feature as a follow on PR in case anyone would like to disable this

@etseidl
Copy link
Contributor

etseidl commented Jan 8, 2025

@etseidl I wonder if you have any thoughts on this PR?

None that haven't already been voiced. It seems like a fairly low risk (especially if made optional) way to get a significant speed up in string handling.

+1

@alamb
Copy link
Contributor

alamb commented Jan 14, 2025

I merged this branch up from main and intend to merge it when the CI passes.

I will also make a follow on PR to make this feature optional.

@alamb
Copy link
Contributor

alamb commented Jan 14, 2025

Let's keep chipping away at performance 🚀 -- thank you all for your thoughts and help here

Here is the follow on PR to add validation as a feature flag:

@alamb alamb merged commit 90d77bd into apache:main Jan 14, 2025
17 checks passed
@andygrove andygrove changed the title Faster parquet utf8 validation using simdjson Faster parquet utf8 validation using simdutf8 Jan 14, 2025
svencowart pushed a commit to elastiflow/arrow-rs that referenced this pull request Jan 14, 2025
* Faster utf8 validation

* Move dependency

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up Parquet utf8 validation
7 participants