Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter pushdown discrepancy between duckdb and duckplyr #172

Open
PMassicotte opened this issue May 21, 2024 · 7 comments
Open

Filter pushdown discrepancy between duckdb and duckplyr #172

PMassicotte opened this issue May 21, 2024 · 7 comments

Comments

@PMassicotte
Copy link

This issue is related to the discussion in this comment:
#145 (comment)

The following code demonstrates the difference in filter
pushdown between DuckDB and Duckplyr as shown by the explain()
function.

With Duckplyr, the filter does not appear to be pushed down to the
Parquet files.

library(duckdb)
#> Loading required package: DBI
library(duckplyr)
#> ✔ Overwriting dplyr methods with duckplyr methods.
#> ℹ Turn off with `duckplyr::methods_restore()`.
#> 
#> Attaching package: 'duckplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:dbplyr':
#> 
#>     ident, sql
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# DuckDB ------------------------------------------------------------------

con <- dbConnect(duckdb())

dbSendQuery(con, "INSTALL httpfs; LOAD httpfs;")
#> <duckdb_result af590 connection=7ce20 statement='INSTALL httpfs; LOAD httpfs;'>
dbSendQuery(con, "SET s3_region='auto';SET s3_endpoint='';")
#> <duckdb_result e70c0 connection=7ce20 statement='SET s3_region='auto';SET s3_endpoint='';'>
f_duckdb <- function() {
  df <- tbl(
    con,
    "read_parquet('s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet')"
  )

  df |>
    filter(total_amount > 0L) |>
    filter(!is.na(passenger_count)) |>
    mutate(tip_pct = 100 * tip_amount / total_amount) |>
    summarise(
      avg_tip_pct = median(tip_pct),
      n = n(),
      .by = passenger_count
    ) |>
    arrange(desc(passenger_count))
}

# Duckplyr ----------------------------------------------------------------

con <- duckplyr:::get_default_duckdb_connection()

dbSendQuery(con, "INSTALL httpfs; LOAD httpfs;")
#> <duckdb_result bab60 connection=cd550 statement='INSTALL httpfs; LOAD httpfs;'>
dbSendQuery(con, "SET s3_region='auto';SET s3_endpoint='';")
#> <duckdb_result e8960 connection=cd550 statement='SET s3_region='auto';SET s3_endpoint='';'>
f_duckplyr <- function() {
  duckplyr::duckplyr_df_from_file(
    "s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet",
    "read_parquet",
    options = list(hive_partitioning = TRUE),
    class = class(tibble())
  ) |>
    filter(total_amount > 0L) |>
    filter(!is.na(passenger_count)) |>
    mutate(tip_pct = 100 * tip_amount / total_amount) |>
    summarise(
      avg_tip_pct = median(tip_pct),
      n = n(),
      .by = passenger_count
    ) |>
    arrange(desc(passenger_count))
}

explain(f_duckdb())
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> <SQL>
#> SELECT
#>   passenger_count,
#>   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY tip_pct) AS avg_tip_pct,
#>   COUNT(*) AS n
#> FROM (
#>   SELECT q01.*, (100.0 * tip_amount) / total_amount AS tip_pct
#>   FROM (FROM read_parquet('s3://duckplyr-demo-taxi-data/taxi-data-2019-partitioned/*/*.parquet')) q01
#>   WHERE (total_amount > 0) AND (NOT((passenger_count IS NULL)))
#> ) q01
#> GROUP BY passenger_count
#> ORDER BY passenger_count DESC
#> 
#> <PLAN>
#> physical_plan
#> ┌───────────────────────────┐
#> │          ORDER_BY         │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │          ORDERS:          │
#> │  q01.passenger_count DESC │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       HASH_GROUP_BY       │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             #0            │
#> │     quantile_cont(#1)     │
#> │        count_star()       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │           FILTER          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │  (NOT (passenger_count IS │
#> │           NULL))          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 3524697        │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       READ_PARQUET        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │         tip_amount        │
#> │        total_amount       │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │ Filters: total_amount>0.0 │
#> │ AND total_amount IS N...  │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 17623488       │
#> └───────────────────────────┘
explain(f_duckplyr())
#> ┌───────────────────────────┐
#> │          ORDER_BY         │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │          ORDERS:          │
#> │           #3 ASC          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │        avg_tip_pct        │
#> │             n             │
#> │     -(passenger_count)    │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │        avg_tip_pct        │
#> │             n             │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       HASH_GROUP_BY       │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │             #0            │
#> │         median(#1)        │
#> │        count_star()       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │          tip_pct          │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │      passenger_count      │
#> │         tip_amount        │
#> │        total_amount       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │           FILTER          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │(r_base::>(total_amount, 0)│
#> │ AND (NOT ((passenger_count│
#> │   IS NULL) OR isnan(CAST  │
#> │(passenger_count AS DO...  │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 17623488       │
#> └─────────────┬─────────────┘                             
#> ┌─────────────┴─────────────┐
#> │       READ_PARQUET        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        total_amount       │
#> │      passenger_count      │
#> │         tip_amount        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        EC: 88117440       │
#> └───────────────────────────┘

As a result, duckplyr is slightly slower compared to duckdb.

bench::mark(f_duckdb(), f_duckplyr(), check = FALSE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 f_duckdb()      1.29s    1.29s     0.777     336KB     1.55
#> 2 f_duckplyr()    6.95s    6.95s     0.144     103KB     0

Created on 2024-05-21 with reprex v2.1.0

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.0 (2024-04-24)
#>  os       Linux Mint 21.3
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en_CA:en
#>  collate  en_CA.UTF-8
#>  ctype    en_CA.UTF-8
#>  tz       America/Montreal
#>  date     2024-05-21
#>  pandoc   3.1.11 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bench         1.1.3   2023-05-04 [1] RSPM (R 4.4.0)
#>  blob          1.2.4   2023-03-17 [1] RSPM
#>  cli           3.6.2   2023-12-11 [1] RSPM
#>  collections   0.3.7   2023-01-05 [1] RSPM
#>  DBI         * 1.2.2   2024-02-16 [1] RSPM
#>  dbplyr      * 2.5.0   2024-03-19 [1] RSPM
#>  digest        0.6.35  2024-03-11 [1] RSPM
#>  dplyr       * 1.1.4   2023-11-17 [1] RSPM
#>  duckdb      * 0.10.2  2024-05-01 [1] CRAN (R 4.4.0)
#>  duckplyr    * 0.4.0   2024-05-21 [1] CRAN (R 4.4.0)
#>  evaluate      0.23    2023-11-01 [1] RSPM
#>  fansi         1.0.6   2023-12-08 [1] RSPM
#>  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
#>  fs            1.6.4   2024-04-25 [1] CRAN (R 4.4.0)
#>  generics      0.1.3   2022-07-05 [1] RSPM
#>  glue          1.7.0   2024-01-09 [1] RSPM
#>  htmltools     0.5.8.1 2024-04-04 [1] RSPM
#>  knitr         1.46    2024-04-06 [1] RSPM
#>  lifecycle     1.0.4   2023-11-07 [1] RSPM
#>  magrittr      2.0.3   2022-03-30 [1] RSPM
#>  pillar        1.9.0   2023-03-22 [1] RSPM
#>  pkgconfig     2.0.3   2019-09-22 [1] RSPM
#>  profmem       0.6.0   2020-12-13 [1] RSPM (R 4.4.0)
#>  purrr         1.0.2   2023-08-10 [1] RSPM
#>  R.cache       0.16.0  2022-07-21 [1] RSPM
#>  R.methodsS3   1.8.2   2022-06-13 [1] RSPM
#>  R.oo          1.26.0  2024-01-24 [1] RSPM
#>  R.utils       2.12.3  2023-11-18 [1] RSPM
#>  R6            2.5.1   2021-08-19 [1] RSPM
#>  reprex        2.1.0   2024-01-11 [1] RSPM
#>  rlang         1.1.3   2024-01-10 [1] RSPM
#>  rmarkdown     2.27    2024-05-17 [1] RSPM (R 4.4.0)
#>  rstudioapi    0.16.0  2024-03-24 [1] RSPM
#>  sessioninfo   1.2.2   2021-12-06 [1] RSPM
#>  styler        1.10.3  2024-04-07 [1] RSPM
#>  tibble        3.2.1   2023-03-20 [1] RSPM
#>  tidyselect    1.2.1   2024-03-11 [1] RSPM
#>  utf8          1.2.4   2023-10-22 [1] RSPM
#>  vctrs         0.6.5   2023-12-01 [1] RSPM
#>  withr         3.0.0   2024-01-16 [1] RSPM
#>  xfun          0.44    2024-05-15 [1] CRAN (R 4.4.0)
#>  yaml          2.3.8   2023-12-11 [1] RSPM
#> 
#>  [1] /home/filoche/R/x86_64-pc-linux-gnu-library/4.4
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/lib/R/site-library
#>  [4] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@danwwilson
Copy link

I enountered a similar thing where I tried duckplyr on a parquet file and it was considerably slower than running the same query as as a straight query using a dbGetQuery() function. It was also when trying to apply a filter to the data before returning the data. I tried it with a count function instead and duckplyr was much faster.

@hadley
Copy link
Member

hadley commented Sep 25, 2024

Would be good to figure out what's going wrong here.

@toppyy
Copy link
Contributor

toppyy commented Sep 28, 2024

I had a look at the problem. To me it seems that the issue arises from the fact with Duckplyr the expression total_amount > 0 is translated to a function call which DuckDB cannot pushdown to the READ_PARQUET operation.

This can be observed from above query plan as the Duckplyr FILTER-operation filters on this expression: r_base::>(total_amount, 0).

If my debugging is correct, the pushdown operation here is handled ultimately by DuckDB's optimizer in FilterResult FilterCombiner::AddFilter(Expression &expr) . The function pushes down expression which are either foldable or of expression class ExpressionClass::BOUND_BETWEEN or ExpressionClass::BOUND_COMPARISON

From DuckDB's perspective the expression r_base::>(total_amount, 0) is a ExpressionClass::BOUND_FUNCTION which cannot be pushed down in a similar fashion.

I don't have a solution at hand, but I'll look into the possibility of interpreting total_amount > 0 as a comparison expression instead of a function call in similar situations.

@krlmlr
Copy link
Member

krlmlr commented Sep 28, 2024

Thanks for the investigation. I wonder if we somehow can label r_base::> as a BOUND_COMPARISON, which is what it is. We are using r_base::> because we want to fall back to R for all edge cases:

2 > TRUE
#> [1] TRUE

Created on 2024-09-28 with reprex v2.1.0

@toppyy
Copy link
Contributor

toppyy commented Sep 29, 2024

I'm not sure if we can label the expression rbase::>(disp,100) as a comparison operator and remedy the problem. BoundFunctionExpression's have a member function which I assume is called when it falls back to R. Casting to a ComparisonExpressions would mean no such member exists?

If I've understood correctly, we can only pass the following expression types via the duckdb relational api atm

  • FunctionExpression (rapi_expr_function)
  • ConstantExpression (rapi_expr_constant)
  • ColumnRefExpression (rapi_expr_reference)

To enable pushdown, we would need a comparison expression. And to ensure we can fallback on the edge cases @krlmlr mentioned, we would need type information to make a comparison operator only when the types are indeed comparable by duckdb. So the logic would be:

  • total_amount > 0 -> ExpressionType::COMPARE_GREATERTHAN
  • total_amount > TRUE -> ExpressionType::FUNCTION

I though about altering the expression type within rapi_rel_filter, but we don't have type information at that point (ParsedExpressions do not contain types).

The only idea I have is to introduce a ComparisonExpression to the relation api (ie. rapi_expr_comparison) to match duckdb's expression types. This would enable us to make comparison expressions when translating into duckdb expressions, but idk if there's a easier way or this conflicts some other design choices.

@toppyy
Copy link
Contributor

toppyy commented Oct 6, 2024

I added a sketch of the approach outlined above (using comparison expressions). The draft PR's translate total_amount > 0 to a comparison expression which get pushed to the scan operator. total_amount > TRUE still falls back to R with r_base::>.

What do you think about this approach @krlmlr ?

@krlmlr
Copy link
Member

krlmlr commented Oct 6, 2024

Thanks, nice! I need to take a closer look here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants