Dancing 💃 with the stats, aka tibble()
dancing 🕺. dance
is a sort of
reinvention of dplyr
classic verbs, with a more modern stack
underneath, i.e. it leverages a lot from vctrs
and rlang
.
You can install the development version from GitHub.
# install.packages("pak")
pak::pkg_install("romainfrancois/dance")
We’ll illustrate tibble dancing with iris
grouped by Species
.
library(dance)
g <- iris %>% group_by(Species)
These are in the neighborhood of dplyr::summarise()
.
waltz()
takes a grouped tibble and a list of formulas and returns a
tibble with: as many columns as supplied formulas, one row per group. It
does not prepend the grouping variables (see tango
for that).
g %>%
waltz(
Sepal.Length = ~mean(Sepal.Length),
Sepal.Width = ~mean(Sepal.Width)
)
#> # A tibble: 3 x 2
#> Sepal.Length Sepal.Width
#> <dbl> <dbl>
#> 1 5.01 3.43
#> 2 5.94 2.77
#> 3 6.59 2.97
polka()
deals with peeling off one layer of grouping:
g %>%
polka()
#> # A tibble: 3 x 1
#> Species
#> <fct>
#> 1 setosa
#> 2 versicolor
#> 3 virginica
tango()
binds the results of polka()
and waltz()
so is the closest
to dplyr::summarise()
g %>%
tango(
Sepal.Length = ~mean(Sepal.Length),
Sepal.Width = ~mean(Sepal.Width)
)
#> # A tibble: 3 x 3
#> Species Sepal.Length Sepal.Width
#> <fct> <dbl> <dbl>
#> 1 setosa 5.01 3.43
#> 2 versicolor 5.94 2.77
#> 3 virginica 6.59 2.97
charleston()
is like tango
but it packs the new columns in a tibble:
g %>%
charleston(
Sepal.Length = ~mean(Sepal.Length),
Sepal.Width = ~mean(Sepal.Width)
)
#> # A tibble: 3 x 2
#> Species data$Sepal.Length $Sepal.Width
#> <fct> <dbl> <dbl>
#> 1 setosa 5.01 3.43
#> 2 versicolor 5.94 2.77
#> 3 virginica 6.59 2.97
There is no waltz_at()
, tango_at()
, etc … but instead we can use
either the same function on a set of columns or a set of functions on
the same column.
For this, we need to learn new dance moves:
swing()
and twist()
are for applying the same function to a set of
columns:
library(tidyselect)
g %>%
tango(swing(mean, starts_with("Petal")))
#> # A tibble: 3 x 3
#> Species Petal.Length Petal.Width
#> <fct> <dbl> <dbl>
#> 1 setosa 1.46 0.246
#> 2 versicolor 4.26 1.33
#> 3 virginica 5.55 2.03
g %>%
tango(data = twist(mean, starts_with("Petal")))
#> # A tibble: 3 x 2
#> Species data$Petal.Length $Petal.Width
#> <fct> <dbl> <dbl>
#> 1 setosa 1.46 0.246
#> 2 versicolor 4.26 1.33
#> 3 virginica 5.55 2.03
They differ in the type of column is created and how to name them:
swing()
makes as many new columns as are selected by the tidy selection, and the columns are named using a.name
glue pattern, this way we mightswing()
several times.
g %>%
tango(
swing(mean, starts_with("Petal"), .name = "mean_{var}"),
swing(median, starts_with("Petal"), .name = "median_{var}"),
)
#> # A tibble: 3 x 5
#> Species mean_Petal.Leng… mean_Petal.Width median_Petal.Le…
#> <fct> <dbl> <dbl> <dbl>
#> 1 setosa 1.46 0.246 1.5
#> 2 versic… 4.26 1.33 4.35
#> 3 virgin… 5.55 2.03 5.55
#> # … with 1 more variable: median_Petal.Width <dbl>
twist()
instead creates a single data frame column.
g %>%
tango(
mean = twist(mean, starts_with("Petal")),
median = twist(median, starts_with("Petal")),
)
#> # A tibble: 3 x 3
#> Species mean$Petal.Length $Petal.Width median$Petal.Leng… $Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 1.46 0.246 1.5 0.2
#> 2 versicolor 4.26 1.33 4.35 1.3
#> 3 virginica 5.55 2.03 5.55 2
The first arguments of swing()
and twist()
are either a function or
a formula that uses .
as a placeholder. Subsequent arguments are
tidyselect selections.
You can combine swing()
and twist()
in the same tango()
or
waltz()
:
g %>%
tango(
swing(mean, starts_with("Petal"), .name = "mean_{var}"),
median = twist(median, contains("."))
)
#> # A tibble: 3 x 4
#> Species mean_Petal.Leng… mean_Petal.Width median$Sepal.Le… $Sepal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 1.46 0.246 5 3.4
#> 2 versic… 4.26 1.33 5.9 2.8
#> 3 virgin… 5.55 2.03 6.5 3
#> # … with 2 more variables: $Petal.Length <dbl>, $Petal.Width <dbl>
Similarly rumba()
can be used to apply several functions to a single
column. rumba()
creates single columns and zumba()
packs them into a
data frame column.
g %>%
tango(
rumba(Sepal.Width, mean = mean, median = median, .name = "Sepal_{fun}"),
Petal = zumba(Petal.Width, mean = mean, median = median)
)
#> # A tibble: 3 x 4
#> Species Sepal_mean Sepal_median Petal$mean $median
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 3.4 0.246 0.2
#> 2 versicolor 2.77 2.8 1.33 1.3
#> 3 virginica 2.97 3 2.03 2
Now we enter the realms of dplyr::mutate()
with:
salsa()
: to create new columnschacha()
: to reorganize a grouped tibble so that data for each group is contiguoussamba()
:chacha()
+salsa()
g %>%
salsa(
Sepal = ~Sepal.Length * Sepal.Width,
Petal = ~Petal.Length * Petal.Width
)
#> # A tibble: 150 x 2
#> Sepal Petal
#> <dbl> <dbl>
#> 1 17.8 0.280
#> 2 14.7 0.280
#> 3 15.0 0.26
#> 4 14.3 0.3
#> 5 18 0.280
#> 6 21.1 0.68
#> 7 15.6 0.42
#> 8 17 0.3
#> 9 12.8 0.280
#> 10 15.2 0.15
#> # … with 140 more rows
You can swing()
, twist()
, rumba()
and zumba()
here too, and if
you want the original data, you can use samba()
instead of salsa()
:
g %>%
samba(centered = twist(~ . - mean(.), everything(), -Species))
#> # A tibble: 150 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows, and 4 more variables: centered$Sepal.Length <dbl>,
#> # $Sepal.Width <dbl>, $Petal.Length <dbl>, $Petal.Width <dbl>
madison()
packs the columns salsa()
would have created
g %>%
madison(swing(~ . - mean(.), starts_with("Sepal")))
#> # A tibble: 150 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows, and 2 more variables: data$Sepal.Length <dbl>,
#> # $Sepal.Width <dbl>
bolero()
is similar to dplyr::filter()
. The formulas may be made by
mambo()
if you want to apply the same predicate to a tidyselection of
columns:
g %>%
bolero(~Sepal.Width > 4)
#> # A tibble: 3 x 5
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.7 4.4 1.5 0.4 setosa
#> 2 5.2 4.1 1.5 0.1 setosa
#> 3 5.5 4.2 1.4 0.2 setosa
g %>%
bolero(mambo(~. > 4, starts_with("Sepal")))
#> # A tibble: 3 x 5
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.7 4.4 1.5 0.4 setosa
#> 2 5.2 4.1 1.5 0.1 setosa
#> 3 5.5 4.2 1.4 0.2 setosa
g %>%
bolero(mambo(~. > 4, starts_with("Sepal"), .op = or))
#> # A tibble: 150 x 5
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows