diff --git a/README.Rmd b/README.Rmd index 5ea54957..4db6da8d 100644 --- a/README.Rmd +++ b/README.Rmd @@ -24,12 +24,36 @@ set.seed(20230702) [![R-CMD-check](https://github.com/duckdblabs/duckplyr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/duckdblabs/duckplyr/actions/workflows/R-CMD-check.yaml) -The goal of duckplyr is to provide a drop-in replacement for dplyr that uses DuckDB as a backend for fast operation. -It also defines a set of generics that provide a low-level implementer's interface for dplyr's high-level user interface. +The goal of duckplyr is to provide a drop-in replacement for dplyr that uses [DuckDB](https://duckdb.org/) as a backend for fast operation. +DuckDB is an in-process SQL OLAP database management system. -## Example +duckplyr also defines a set of generics that provide a low-level implementer's interface for dplyr's high-level user interface. -```{r load} +## Installation + +Install duckplyr from CRAN with: + +``` r +install.packages("duckplyr") +``` + +You can also install the development version of duckplyr from R-universe: + +``` r +install.packages('duckplyr', repos = c('https://duckdblabs.r-universe.dev', 'https://cloud.r-project.org')) +``` + +Or from [GitHub](https://github.com/) with: + +``` r +# install.packages("pak", repos = sprintf("https://r-lib.github.io/p/pak/stable/%s/%s/%s", .Platform$pkgType, R.Version()$os, R.Version()$arch)) +pak::pak("duckdblabs/duckplyr") +``` + + +## Examples + +```{r attach} library(conflicted) library(duckplyr) conflict_prefer("filter", "duckplyr") @@ -37,18 +61,19 @@ conflict_prefer("filter", "duckplyr") There are two ways to use duckplyr. -1. To enable for individual data frames, use `as_duckplyr_df()` as the first step in your pipe. -1. To enable for the entire session, use `methods_overwrite()`. +1. To enable duckplyr for individual data frames, use `as_duckplyr_df()` as the first step in your pipe. +1. To enable duckplyr for the entire session, use `methods_overwrite()`. The examples below illustrate both methods. See also the companion [demo repository](https://github.com/Tmonster/duckplyr_demo) for a use case with a large dataset. -### Individual +### Usage for individual data frames This example illustrates usage of duckplyr for individual data frames. -```{r individual} -# Use `as_duckplyr_df()` to enable processing with duckdb: +Use `as_duckplyr_df()` to enable processing with duckdb: + +```{r} out <- palmerpenguins::penguins %>% # CAVEAT: factor columns are not supported yet @@ -57,35 +82,49 @@ out <- mutate(bill_area = bill_length_mm * bill_depth_mm) %>% summarize(.by = c(species, sex), mean_bill_area = mean(bill_area)) %>% filter(species != "Gentoo") +``` + +The result is a data frame or tibble, with its own class. -# The result is a data frame or tibble, with its own class. +```{r} class(out) names(out) +``` + +duckdb is responsible for eventually carrying out the operations. +Despite the late filter, the summary is not computed for the Gentoo species. -# duckdb is responsible for eventually carrying out the operations. -# Despite the late filter, the summary is not computed for the Gentoo species. +```{r} out %>% explain() +``` -# All data frame operations are supported. -# Computation happens upon the first request. +All data frame operations are supported. +Computation happens upon the first request. + +```{r} out$mean_bill_area +``` -# After the computation has been carried out, the results are available -# immediately: +After the computation has been carried out, the results are available immediately: + +```{r} out ``` -### Session-wide +### Session-wide usage This example illustrates usage of duckplyr for all data frames in the R session. -```{r session} -# Use `methods_overwrite()` to enable processing with duckdb for all data frames: +Use `methods_overwrite()` to enable processing with duckdb for all data frames: + +```{r} methods_overwrite() +``` +This is the same query as above, without `as_duckplyr_df()`: -# This is the same query as above, without `as_duckplyr_df()`: +```{r} out <- palmerpenguins::penguins %>% # CAVEAT: factor columns are not supported yet @@ -93,17 +132,29 @@ out <- mutate(bill_area = bill_length_mm * bill_depth_mm) %>% summarize(.by = c(species, sex), mean_bill_area = mean(bill_area)) %>% filter(species != "Gentoo") +``` + +The result is a plain tibble now: -# The result is a plain tibble now: +```{r} class(out) +``` + +Querying the number of rows also starts the computation: -# Querying the number of rows also starts the computation: +```{r} nrow(out) +``` -# Restart R, or call `methods_restore()` to revert to the default dplyr implementation. +Restart R, or call `methods_restore()` to revert to the default dplyr implementation. + +```{r} methods_restore() +``` + +dplyr is active again: -# dplyr is active again: +```{r} palmerpenguins::penguins %>% # CAVEAT: factor columns are not supported yet mutate(across(where(is.factor), as.character)) %>% @@ -211,17 +262,3 @@ rel_names.dfrel <- function(rel, ...) { rel_names(mtcars_rel) ``` -## Installation - -Install duckplyr from CRAN with: - -``` r -install.packages("duckplyr") -``` - -You can also install the development version of duckplyr from [GitHub](https://github.com/) with: - -``` r -# install.packages("pak", repos = sprintf("https://r-lib.github.io/p/pak/stable/%s/%s/%s", .Platform$pkgType, R.Version()$os, R.Version()$arch)) -pak::pak("duckdblabs/duckplyr") -``` diff --git a/README.md b/README.md index c2936ff2..b965034f 100644 --- a/README.md +++ b/README.md @@ -8,9 +8,29 @@ -The goal of duckplyr is to provide a drop-in replacement for dplyr that uses DuckDB as a backend for fast operation. It also defines a set of generics that provide a low-level implementer’s interface for dplyr’s high-level user interface. +The goal of duckplyr is to provide a drop-in replacement for dplyr that uses [DuckDB](https://duckdb.org/) as a backend for fast operation. DuckDB is an in-process SQL OLAP database management system. -## Example +duckplyr also defines a set of generics that provide a low-level implementer’s interface for dplyr’s high-level user interface. + +## Installation + +Install duckplyr from CRAN with: + +
+install.packages("duckplyr")
+
+You can also install the development version of duckplyr from R-universe:
+
+
+install.packages('duckplyr', repos = c('https://duckdblabs.r-universe.dev', 'https://cloud.r-project.org'))
+
+Or from [GitHub](https://github.com/) with:
+
++# install.packages("pak", repos = sprintf("https://r-lib.github.io/p/pak/stable/%s/%s/%s", .Platform$pkgType, R.Version()$os, R.Version()$arch)) +pak::pak("duckdblabs/duckplyr")+ +## Examples
library(conflicted) @@ -21,17 +41,18 @@ The goal of duckplyr is to provide a drop-in replacement for dplyr that uses Duc There are two ways to use duckplyr. -1. To enable for individual data frames, use [`as_duckplyr_df()`](https://duckdblabs.github.io/duckplyr/reference/as_duckplyr_df.html) as the first step in your pipe. -2. To enable for the entire session, use [`methods_overwrite()`](https://duckdblabs.github.io/duckplyr/reference/methods_overwrite.html). +1. To enable duckplyr for individual data frames, use [`as_duckplyr_df()`](https://duckdblabs.github.io/duckplyr/reference/as_duckplyr_df.html) as the first step in your pipe. +2. To enable duckplyr for the entire session, use [`methods_overwrite()`](https://duckdblabs.github.io/duckplyr/reference/methods_overwrite.html). The examples below illustrate both methods. See also the companion [demo repository](https://github.com/Tmonster/duckplyr_demo) for a use case with a large dataset. -### Individual +### Usage for individual data frames This example illustrates usage of duckplyr for individual data frames. +Use [`as_duckplyr_df()`](https://duckdblabs.github.io/duckplyr/reference/as_duckplyr_df.html) to enable processing with duckdb: +-# Use `as_duckplyr_df()` to enable processing with duckdb: out <- palmerpenguins::penguins %>% # CAVEAT: factor columns are not supported yet @@ -39,16 +60,19 @@ This example illustrates usage of duckplyr for individual data frames. as_duckplyr_df() %>% mutate(bill_area = bill_length_mm * bill_depth_mm) %>% summarize(.by = c(species, sex), mean_bill_area = mean(bill_area)) %>% - filter(species != "Gentoo") - -# The result is a data frame or tibble, with its own class. + filter(species != "Gentoo")+ +The result is a data frame or tibble, with its own class. + +class(out) #> [1] "duckplyr_df" "tbl_df" "tbl" "data.frame" names(out) -#> [1] "species" "sex" "mean_bill_area" - -# duckdb is responsible for eventually carrying out the operations. -# Despite the late filter, the summary is not computed for the Gentoo species. +#> [1] "species" "sex" "mean_bill_area"+ +duckdb is responsible for eventually carrying out the operations. Despite the late filter, the summary is not computed for the Gentoo species. + +out %>% explain() #> ┌───────────────────────────┐ @@ -77,7 +101,7 @@ This example illustrates usage of duckplyr for individual data frames. #> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ #> │ (species != 'Gentoo') │ #> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ -#> │ EC: 0 │ +#> │ EC: 344 │ #> └─────────────┬─────────────┘ #> ┌─────────────┴─────────────┐ #> │ R_DATAFRAME_SCAN │ @@ -89,11 +113,12 @@ This example illustrates usage of duckplyr for individual data frames. #> │ bill_depth_mm │ #> │ sex │ #> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ -#> │ EC: 0 │ -#> └───────────────────────────┘ - -# All data frame operations are supported. -# Computation happens upon the first request. +#> │ EC: 344 │ +#> └───────────────────────────┘+ +All data frame operations are supported. Computation happens upon the first request. + +out$mean_bill_area #> materializing: #> --------------------- @@ -102,7 +127,7 @@ This example illustrates usage of duckplyr for individual data frames. #> Filter [!=(species, 'Gentoo')] #> Aggregate [species, sex, mean(bill_area)] #> Projection [species as species, island as island, bill_length_mm as bill_length_mm, bill_depth_mm as bill_depth_mm, flipper_length_mm as flipper_length_mm, body_mass_g as body_mass_g, sex as sex, "year" as year, *(bill_length_mm, bill_depth_mm) as bill_area] -#> r_dataframe_scan(0x107a45b78) +#> r_dataframe_scan(0x10c4c7628) #> #> --------------------- #> -- Result Columns -- @@ -111,29 +136,33 @@ This example illustrates usage of duckplyr for individual data frames. #> - sex (VARCHAR) #> - mean_bill_area (DOUBLE) #> -#> [1] 770.2627 656.8523 694.9360 819.7503 984.2279 - -# After the computation has been carried out, the results are available -# immediately: +#> [1] 770.2627 656.8523 819.7503 694.9360 984.2279+ +After the computation has been carried out, the results are available immediately: + +out #> # A tibble: 5 × 3 #> species sex mean_bill_area #> <chr> <chr> <dbl> #> 1 Adelie male 770. #> 2 Adelie female 657. -#> 3 Adelie NA 695. -#> 4 Chinstrap female 820. +#> 3 Chinstrap female 820. +#> 4 Adelie NA 695. #> 5 Chinstrap male 984.-### Session-wide +### Session-wide usage This example illustrates usage of duckplyr for all data frames in the R session. +Use [`methods_overwrite()`](https://duckdblabs.github.io/duckplyr/reference/methods_overwrite.html) to enable processing with duckdb for all data frames: + ++methods_overwrite()
+ +This is the same query as above, without [`as_duckplyr_df()`](https://duckdblabs.github.io/duckplyr/reference/as_duckplyr_df.html): +-# Use `methods_overwrite()` to enable processing with duckdb for all data frames: -methods_overwrite() - -# This is the same query as above, without `as_duckplyr_df()`: out <- palmerpenguins::penguins %>% # CAVEAT: factor columns are not supported yet @@ -141,12 +170,19 @@ This example illustrates usage of duckplyr for all data frames in the R session. mutate(bill_area = bill_length_mm * bill_depth_mm) %>% summarize(.by = c(species, sex), mean_bill_area = mean(bill_area)) %>% filter(species != "Gentoo") - -# The result is a plain tibble now: +#> Error processing with relational. +#> Caused by error in `duckdb_rel_from_df()`: +#> ! Can't convert factor columns to relational. Affected column: `species`.+ +The result is a plain tibble now: + +class(out) -#> [1] "tbl_df" "tbl" "data.frame" - -# Querying the number of rows also starts the computation: +#> [1] "tbl_df" "tbl" "data.frame"+ +Querying the number of rows also starts the computation: + +nrow(out) #> materializing: #> --------------------- @@ -155,7 +191,7 @@ This example illustrates usage of duckplyr for all data frames in the R session. #> Filter [!=(species, 'Gentoo')] #> Aggregate [species, sex, mean(bill_area)] #> Projection [species as species, island as island, bill_length_mm as bill_length_mm, bill_depth_mm as bill_depth_mm, flipper_length_mm as flipper_length_mm, body_mass_g as body_mass_g, sex as sex, "year" as year, *(bill_length_mm, bill_depth_mm) as bill_area] -#> r_dataframe_scan(0x1254b4b58) +#> r_dataframe_scan(0x10a81d568) #> #> --------------------- #> -- Result Columns -- @@ -163,12 +199,16 @@ This example illustrates usage of duckplyr for all data frames in the R session. #> - species (VARCHAR) #> - sex (VARCHAR) #> - mean_bill_area (DOUBLE) -#> [1] 5 - -# Restart R, or call `methods_restore()` to revert to the default dplyr implementation. -methods_restore() - -# dplyr is active again: +#> [1] 5+ +Restart R, or call [`methods_restore()`](https://duckdblabs.github.io/duckplyr/reference/methods_overwrite.html) to revert to the default dplyr implementation. + ++methods_restore()
+ +dplyr is active again: + +palmerpenguins::penguins %>% # CAVEAT: factor columns are not supported yet mutate(across(where(is.factor), as.character)) %>% @@ -339,16 +379,3 @@ This package also provides generics, for which other packages may then implement rel_names(mtcars_rel) #> [1] "mpg" "cyl" "disp" "hp"- -## Installation - -Install duckplyr from CRAN with: - --install.packages("duckplyr")
- -You can also install the development version of duckplyr from [GitHub](https://github.com/) with: - --# install.packages("pak", repos = sprintf("https://r-lib.github.io/p/pak/stable/%s/%s/%s", .Platform$pkgType, R.Version()$os, R.Version()$arch)) -pak::pak("duckdblabs/duckplyr")