-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hyperLogLogPlus aggregation function for distinct count #11346
Conversation
32a26a0
to
5918003
Compare
Codecov Report
@@ Coverage Diff @@
## master #11346 +/- ##
============================================
+ Coverage 63.07% 63.10% +0.02%
- Complexity 1110 1121 +11
============================================
Files 2326 2342 +16
Lines 124918 125686 +768
Branches 19145 19306 +161
============================================
+ Hits 78792 79311 +519
- Misses 40510 40714 +204
- Partials 5616 5661 +45
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 42 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
c8948ae
to
7cbea4a
Compare
@Jackie-Jiang @chenboat @cbalci can you please review this PR? |
Test Query
Approximation Benchmarking result Result
|
public DistinctCountHLLPlusAggregationFunction(List<ExpressionContext> arguments) { | ||
super(arguments.get(0)); | ||
int numExpressions = arguments.size(); | ||
// This function expects 1 or 2 arguments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update javadoc, should we expect 2 or 3 instead based on the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank @chenboat for review. this function epxects 1 or 2 or 3 arguments.
I'd like to spend some time reviewing this today / tomorrow. |
@siddharthteotia thanks for reviewing. I will fix this conflict in this hour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Would you mind adding more info to this PR, like what functionalities are added, what's the difference between the raw one and the non-raw one, what the purpose of your benchmarking test is?
} | ||
return hllplus; | ||
} catch (Exception e) { | ||
throw new RuntimeException("Caught exception while merging HyperLogLogs", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Caught exception while merging HyperLogLogsPlus
public int getMaxAggregatedValueByteSize() { | ||
// NOTE: For aggregated metrics, initial aggregated value might have not been generated. Returns the byte size | ||
// based on log2m. | ||
return _maxByteSize > 0 ? _maxByteSize : 100000000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line doesn't seem to align with the comment above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @jackjlli for pointing out. let me uncomment the line below and remove this line.
2a995c4
to
9308869
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with minor comments
} | ||
return hllplus; | ||
} catch (Exception e) { | ||
throw new RuntimeException("Caught exception while merging HyperLogLogs", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) Revise the message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @Jackie-Jiang i fixed the error message.
private static HyperLogLogPlus getDistinctCountHLLPlusResult(Dictionary dictionary, | ||
DistinctCountHLLPlusAggregationFunction function) { | ||
if (dictionary.getValueType() == FieldSpec.DataType.BYTES) { | ||
// Treat BYTES value as serialized HyperLogLog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) Revise the comment
|
||
@Override | ||
public HyperLogLogPlus deserialize(ByteBuffer byteBuffer) { | ||
// NOTE: The passed in byte buffer is always BIG ENDIAN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) Remove this comment as it doesn't apply
try { | ||
return HyperLogLogPlus.Builder.build(bytes); | ||
} catch (IOException e) { | ||
throw new RuntimeException("Caught exception while de-serializing HyperLogLogPlus", e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try { | |
return HyperLogLogPlus.Builder.build(bytes); | |
} catch (IOException e) { | |
throw new RuntimeException("Caught exception while de-serializing HyperLogLogPlus", e); | |
} | |
return deserialize(bytes); |
thanks @Jackie-Jiang @jackjlli @siddharthteotia and @chenboat for review. I will address comment today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor but LGTM. Thanks for adding the aggregation function!
try { | ||
HyperLogLogPlus hyperLogLogPlus = aggregationResultHolder.getResult(); | ||
if (hyperLogLogPlus != null) { | ||
for (int i = 0; i < length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: these two for loop can be merged. We just need to check if (hyperLogLogPlus == null)
. If yes, assign the first one to this reference. The following logic should be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for code review! addressed the code comment, please take a look
hyperLogLogPlus.addAll(ObjectSerDeUtils.HYPER_LOG_LOG_PLUS_SER_DE.deserialize(bytesValues[i])); | ||
} | ||
} | ||
for (int i = 0; i < length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: int i = 1;
93ecb3b
to
75da176
Compare
if (hyperLogLogPlus == null) { | ||
hyperLogLogPlus = ObjectSerDeUtils.HYPER_LOG_LOG_PLUS_SER_DE.deserialize(bytesValues[0]); | ||
aggregationResultHolder.setValue(hyperLogLogPlus); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the else block if the check is not qualified:
} else {
hyperLogLogPlus.addAll(ObjectSerDeUtils.HYPER_LOG_LOG_PLUS_SER_DE.deserialize(bytesValues[0]));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got your point, updated, thanks!
75da176
to
f0cf7f2
Compare
feature
: add DistinctCountHLLPlus function.Context:
currently pinot is using Clearspring implementation of HLL algorithm. https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLog.java
reference: HLL paper: https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
we have some customers observed ElasticSearch dataset to Pinot, Pinot is using HLL algorithm while ES is using HLL++, user reported that HLL++ has higher accuracy than HLL when cardinality of dimension is at 10k-100k.
reference: HLL++ paper https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf
We tried to solve the above issue by tuning the log2m parameters of HLL (distinctCountHLL) functions, then we observed the CPU usage increased and bring the cluster into unstable state.
This PR tried to bridge the gap between ES and Pinot by introducing HyperLogLogPlusPlus to Pinot. Clearspring already has support for the HLL++. https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java
Approximation Benchmark:
Test Query
Approximation Benchmarking result
Result