-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BQ Pushdown support for Group By Plugin aggregators: Sum Of Squares,Corrected Sum of Squares #1580
base: develop
Are you sure you want to change the base?
Conversation
…Corrected Sum of Squares
@@ -124,6 +124,8 @@ public class GroupByAggregator extends RecordReducibleAggregator<AggregateResult | |||
put(GroupByConfig.Function.CONCATDISTINCT, "STRING_AGG(DISTINCT CAST(%s AS STRING) , \", \")"); | |||
put(GroupByConfig.Function.LOGICALAND, "COALESCE(LOGICAL_AND(%s), TRUE)"); | |||
put(GroupByConfig.Function.LOGICALOR, "COALESCE(LOGICAL_OR(%s), FALSE)"); | |||
put(GroupByConfig.Function.SUMOFSQUARES, "SUM(POW(%s, 2))"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the ANSI SQL POWER
function. POW is a BigQuery-only method name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you mean SUM(POWER(%s, 2)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
Also, the Spark function returns 0 if the input is empty, so we need to wrap this in a conditional block:
CASE WHEN COUNT(%s) > 0 THEN <this expression> ELSE 0 END
See https://github.com/cdapio/hydrator-plugins/pull/1575/files#diff-f285497c578d095fc4c85b1968df3e4dc52220cf71458410b5e1a347081f531fR121 for an example
@@ -124,6 +124,8 @@ public class GroupByAggregator extends RecordReducibleAggregator<AggregateResult | |||
put(GroupByConfig.Function.CONCATDISTINCT, "STRING_AGG(DISTINCT CAST(%s AS STRING) , \", \")"); | |||
put(GroupByConfig.Function.LOGICALAND, "COALESCE(LOGICAL_AND(%s), TRUE)"); | |||
put(GroupByConfig.Function.LOGICALOR, "COALESCE(LOGICAL_OR(%s), FALSE)"); | |||
put(GroupByConfig.Function.SUMOFSQUARES, "SUM(POW(%s, 2))"); | |||
put(GroupByConfig.Function.CORRECTEDSUMOFSQUARES, "SUM(POW(%s, 2)) - SUM(%s)/COUNT(%s)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This format requires 3 arguments instead of one. In order to repeat the same element multiple times, see line 121:
String.format("%s %<s %<s", "bla") -> bla bla bla
Similar to the previous method, we need to use POWER
(Ansi SQL) instead of POW
(BigQuery specific). This way we can move this to the Ansi block. Please rebase with the latest.
The second SUM
statement should be elevated to the power of 2.
Finally, this method will have a division by zero if COUNT(exp) = 0. Since the Spark implementation returns 0 if there are no entries, we can use a case statement to solve this:
CASE WHEN COUNT(%s) > 0 THEN <this expression> ELSE 0 END
No description provided.