Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative HyperLogLog implementation with higher accuracy #4248

Closed
Jackie-Jiang opened this issue May 28, 2019 · 13 comments
Closed

Alternative HyperLogLog implementation with higher accuracy #4248

Jackie-Jiang opened this issue May 28, 2019 · 13 comments

Comments

@Jackie-Jiang
Copy link
Contributor

Current HyperLogLog implementation is fairly old: https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLog.java

There are several alternative implementations that can potentially reduce memory cost and improve accuracy.
https://github.com/axiomhq/hyperloglog
https://github.com/clarkduvall/hyperloglog

@xiangfu0
Copy link
Contributor

@jamesyfshao just fyi, I think mobile analytics is using fasthll right now

@Jackie-Jiang
Copy link
Contributor Author

@jamesyfshao If you are using FASTHLL, please check out the documentation change in #4251 as we are moving away from FASTHLL to DISCTINCTCOUNTHLL.

@xiangfu0
Copy link
Contributor

xiangfu0 commented May 29, 2019

Could you elaborate more on the performance and do you do the performance testing?

Below is how Uber uses fasthll:
For fasthll, there is a pre-aggregation phase before data ingestion, e.g. we have a job publish all the dimension combinations and it's corresponding unique counting FASTHLL object every 5mins and offline job is daily basis. This leads to data volume reduction to only 10% of the original size.

From both storage and computation perspective, fasthll is more efficient IMO.

@Jackie-Jiang
Copy link
Contributor Author

@fx19880617 FASTHLL takes String type of serialized HyperLogLog objects, while DISTINCTCOUNTHLL can take BYTES (byte array) type of serialized HyperLogLog objects, which is more efficient in both storage and deserialization. The BYTES support for DISTINCTCOUNTHLL is new added after Pinot supporting BYTES data type.

@xiangfu0
Copy link
Contributor

Got it. @jamesyfshao ^^

@xiangfu0
Copy link
Contributor

also cc @icefury71 ^^

@extesy
Copy link

extesy commented Jun 4, 2019

https://github.com/prasanthj/hyperloglog is a java version (already in maven) which can be easily integrated.

@extesy
Copy link

extesy commented Jul 11, 2019

@Jackie-Jiang @fx19880617 Any follow-up?

@extesy
Copy link

extesy commented Aug 1, 2019

@Jackie-Jiang @fx19880617 @jamesyfshao any follow-up?

@Jackie-Jiang
Copy link
Contributor Author

After some research, will try out the DataSketches (https://datasketches.github.io) library and evaluate its performance.

@leerho
Copy link

leerho commented Oct 9, 2019

@Jackie-Jiang
We haven't evaluated all of the above HLL variants, but we have a detailed comparison of the DataSketches HllSketch vs the Clearspring HLL++ implementation on our website:

https://datasketches.github.io/docs/HLL/Hll_vs_Hllpp.html

There is also a detailed discussion about measuring sketch performance in general that might be helpful.

Note that the above links are to our old website, which has not yet migrated to Apache yet. The links will soon change to datasketches.apache.org.

@deemoliu
Copy link
Contributor

deemoliu commented Jun 29, 2023

Hi @Jackie-Jiang

currently pinot is using https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLog.java

we have some customers observed that HLL++ has higher accuracy than HLL when cardinality of dimension is at 10k-100k.

We also tried different log2m values for distinctCountHLL but it will bring more load on CPU.

is there a plan to support hll++ in Pinot distinctHLLL (or other functions)? https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java

@deemoliu
Copy link
Contributor

linked a Draft PR: #11346

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants