-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative HyperLogLog implementation with higher accuracy #4248
Comments
@jamesyfshao just fyi, I think mobile analytics is using fasthll right now |
@jamesyfshao If you are using FASTHLL, please check out the documentation change in #4251 as we are moving away from FASTHLL to DISCTINCTCOUNTHLL. |
Could you elaborate more on the performance and do you do the performance testing? Below is how Uber uses fasthll: From both storage and computation perspective, fasthll is more efficient IMO. |
@fx19880617 FASTHLL takes String type of serialized HyperLogLog objects, while DISTINCTCOUNTHLL can take BYTES (byte array) type of serialized HyperLogLog objects, which is more efficient in both storage and deserialization. The BYTES support for DISTINCTCOUNTHLL is new added after Pinot supporting BYTES data type. |
Got it. @jamesyfshao ^^ |
also cc @icefury71 ^^ |
https://github.com/prasanthj/hyperloglog is a java version (already in maven) which can be easily integrated. |
@Jackie-Jiang @fx19880617 Any follow-up? |
@Jackie-Jiang @fx19880617 @jamesyfshao any follow-up? |
After some research, will try out the DataSketches (https://datasketches.github.io) library and evaluate its performance. |
@Jackie-Jiang https://datasketches.github.io/docs/HLL/Hll_vs_Hllpp.html There is also a detailed discussion about measuring sketch performance in general that might be helpful. Note that the above links are to our old website, which has not yet migrated to Apache yet. The links will soon change to datasketches.apache.org. |
currently pinot is using https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLog.java we have some customers observed that HLL++ has higher accuracy than HLL when cardinality of dimension is at 10k-100k. We also tried different log2m values for distinctCountHLL but it will bring more load on CPU. is there a plan to support hll++ in Pinot distinctHLLL (or other functions)? https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java |
linked a Draft PR: #11346 |
Current HyperLogLog implementation is fairly old: https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLog.java
There are several alternative implementations that can potentially reduce memory cost and improve accuracy.
https://github.com/axiomhq/hyperloglog
https://github.com/clarkduvall/hyperloglog
The text was updated successfully, but these errors were encountered: