Skip to content

Commit

Permalink
feat: update basic use vignette with current workflow (#59)
Browse files Browse the repository at this point in the history
  • Loading branch information
hgao1 authored Apr 28, 2024
1 parent 43fd44c commit fce8a82
Show file tree
Hide file tree
Showing 4 changed files with 113 additions and 58 deletions.
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# blueprintr 0.2.5.9000 (dev version)
* Updated the vignettes
* Added a new folder under `inst` to add metadata to the vignettes

# blueprintr 0.2.5
* Add capability to embed custom messages to check results, using `check.errors` attribute in returned logical value
* Refactor side-effect messages from built-in checks to `check.errors`
Expand Down
13 changes: 13 additions & 0 deletions inst/mapping/mtcars_item_mapping.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
"name_1","description_1","coding_1","panel","homogenized_name","homogenized_coding","homogenized_description"
"rn","Name of car","NA","MTCARS_PANEL","name","NA","Name of Car"
"mpg","Miles per gallon","NA","MTCARS_PANEL","mpg","NA","Miles per gallon"
"cyl","Number of cylinders","NA","MTCARS_PANEL","cyl","NA","Number of cylinders"
"disp","Displacement","NA","MTCARS_PANEL","disp","NA","Displacement"
"hp","Gross horsepower","NA","MTCARS_PANEL","hp","NA","Gross horsepower"
"drat","Rear axle ratio","NA","MTCARS_PANEL","drat","NA","Rear axle ratio"
"wt","Weight","NA","MTCARS_PANEL","wt","NA","Weight"
"qsec","Quarter mile time","NA","MTCARS_PANEL","qsec","NA","Quarter mile time"
"vs","Engine","coding(code(""1"",""1""), code(""0"", ""0""))","MTCARS_PANEL","vs","coding(code(""1"",""straight""), code(""0"", ""v-shaped""))","Engine"
"am","Transmission","coding(code(""1"",""1""), code(""0"", ""0""))","MTCARS_PANEL","am","coding(code(""1"",""manual""), code(""0"", ""automatic""))","Transmission"
"gear","Number of forward gears","NA","MTCARS_PANEL","gear","NA","Number of forward gears"
"carb","Number of carburetors","NA","MTCARS_PANEL","carb","NA","Number of carburetors"
14 changes: 14 additions & 0 deletions inst/project/blueprints/example/homogenized.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
"name","type","description","coding"
"name","character","Name of Car",
"mpg","double","Miles per gallon",
"cyl","double","Number of cylinders",
"disp","double","Displacement",
"hp","double","Gross horsepower",
"drat","double","Rear axle ratio",
"wt","double","Weight",
"qsec","double","Quarter mile time",
"vs","character","Engine","coding(code(""straight"",""1""), code(""v-shaped"",""0""))"
"am","character","Transmission","coding(code(""manual"",""1""), code(""automatic"",""0""))"
"gear","double","Number of forward gears",
"carb","double","Number of carburetors",
"wave","character",,
140 changes: 82 additions & 58 deletions vignettes/blueprintr.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "A Walkthrough of blueprintr"
title: "Introduction to blueprintr"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{A Walkthrough of blueprintr}
Expand Down Expand Up @@ -40,88 +40,112 @@ cache_location <- tempdir()
drake::clean(cache = drake::drake_cache(cache_location))
```

blueprintr is a companion to [drake](https://github.com/ropensci/drake) that focuses on documenting and testing tabular data. Whereas drake manages the workflow execution, blueprintr defines a collection of steps that need to be run in a drake workflow.
`blueprintr` is a framework for managing your data assets in a reproducible fashion. While it uses [drake](https://github.com/ropensci/drake) or [targets](https://cran.r-project.org/web/packages/targets/), it adds automated steps for tabular dataset documentation and testing. This allows researchers to create a replicable framework to prevent programming issues from affecting analysis results.

# Basic Use
## Installation

The first, and recommended, step is to attach blueprintr to your R session with `library()`.
```{r setup, results= FALSE}
# install.packages("remotes")
# remotes::install_github("nyuglobalties/blueprintr")
```{r setup}
library(blueprintr)
```

In a [drake project](https://books.ropensci.org/drake/projects.html), all packages that you want attached are declared in a `"packages.R"` file. This `library(blueprintr)` command should go there.
## Designed Use of blueprintr
`blueprintr` provides your data with guardrails typically found in software engineering workflows.
This allows you to test and document before deploying to production.

blueprintr is built around "blueprints." Our first blueprint will be a blueprint for `mtcars`:
The top level of the `blueprintr` workflow is a "blueprints" directory, consisting of `.R` and `.csv` files.

```{r}
blueprint(
"mtcars_dat",
description = "The famous mtcars dataset",
command = {
mtcars
}
)
```
### About blueprints
Each blueprint has two components to it:
* Data Construction Spec, usually a `.R` file that instructs drake or targets on how to build a specific dataset.
* Metadata, usually a `.csv` file that incorporates any mapping files and checks that need to be done on the dataset.

All blueprints have
In order to create a blueprint, we use the `blueprint` function. This function takes three arguments: name (the name of your generated dataset), description (a description of your dataset), command (any functions that need to be applied in order to build the dataset).

* A name (the first argument) for the _target_ dataset.
* A description or brief summary of what the target is. Can be `NULL`.
* A command, which is a quoted statement that has the code for building this target.
* A metadata location, which is a path to where the target metadata is saved.
A project may need only a few blueprints, but more likely you'll need nested blueprints to transform the data.

<div class="vg-warning"><span>the blueprint name is "mtcars_dat" rather than "mtcars". If the two had the same name, drake would determine that the blueprint has a _circular dependency_ (it depends on itself). To avoid this, blueprints should not have the same names as global variables, like `mtcars`.</span></div>
`blueprintr` generates six "steps" (targets) per blueprint:

To get this loaded into a drake plan, we need to _attach_ it to an already existing plan using
Target name | Description
------------------------|--------------
`{blueprint}_initial` | The result of running the blueprint's `command`
`{blueprint}_blueprint` | A copy of the blueprint to be used throughout the plan
`{blueprint}_meta` | A copy of the dataset metadata --- if the metadata file doesn't exist, it will be created in this step
`{blueprint}_meta_path` | Creates the metadata file or loads it
`{blueprint}_checks` | Runs all tests on the `{blueprint}_initial` target
`{blueprint}` | The built dataset after running some cleanup tasks

```r
attach_blueprint(plan, blueprint)
attach_blueprints(plan, ...)
```
<div class="vg-warning"><span>when writing other targets in your plan, it is advised to **not** refer to the `{blueprint}_initial` step since it could have problems which are discovered in the `{blueprint}_checks` step.</span></div>

<div class="vg-info"><span>`attach_blueprints` accepts "[tidy dots](https://adv-r.hadley.nz/quasiquotation.html#tidy-dots)", so if you have a `list()` of blueprints, you can "splat" all of those blueprints into `attach_blueprints` like `attach_blueprints(plan, !!!list_of_blueprints)`</span></div>

If you don't have an existing plan, you can create one with
## Example

```r
plan_from_blueprint(blueprint)
```
Let's take a well known dataset-- `mtcars`, and create a blueprint for it.

For now, we'll use an already existing plan, which is probably what you'll have most of the time.
```{r}
# Keeping the row names under the column `rn`
our_mtcars <- mtcars |> tibble::as_tibble(rownames = "rn")
```{r, include=FALSE}
existing_plan <- drake::drake_plan(initial_vector = runif(1000), squared = initial_vector ^ 2)
# Inspecting our mtcars dataset
head(our_mtcars)
```

We can load in a user-created mapping file. This mapping file acts as a map for any variable name changes, as well as coding changes.
```{r}
attach_blueprint(
existing_plan,
blueprint(
"mtcars_dat",
description = "The famous mtcars dataset",
command = {
mtcars
}
mapping_file <- system.file("mapping/mtcars_item_mapping.csv", package = "blueprintr", mustWork = TRUE)
# Read this csv file:
item_mapping <- mapping_file |>
readr::read_csv(
col_types = readr::cols(
name_1 = readr::col_character(),
description_1 = readr::col_character(),
coding_1 = readr::col_character(),
panel = readr::col_character(),
homogenized_name = readr::col_character(),
homogenized_coding = readr::col_character(),
homogenized_description = readr::col_character()
)
)
)
item_mapping
```

blueprintr creates five "steps" (targets) per blueprint:

Target name | Description
------------------------|--------------
`{blueprint}_initial` | The result of running the blueprint's `command`
`{blueprint}_blueprint` | A copy of the blueprint to be used throughout the plan
`{blueprint}_meta` | A copy of the dataset metadata --- if the metadata file doesn't exist, it will be created in this step
`{blueprint}_checks` | Runs all checks on the `{blueprint}_initial` target
`{blueprint}` | The built dataset after running some cleanup tasks
Then, we typically use a tool such as `panelcleaner` to attach our mapping file to the `mtcars` database.
This is a command executed in the dataset construction spec.
```{r}
blueprint(
"mt_cars",
description = "mtcars database with attached metadata",
annotate = TRUE,
command = {
pnl <- panelcleaner::enpanel("MTCARS_PANEL", our_mtcars) |>
panelcleaner::add_mapping(item_mapping) |>
panelcleaner::homogenize_panel() |>
panelcleaner::bind_waves() |>
as.data.frame()
At this point, you're able to run `drake::make()` on this plan!
pnl_name <- get_attr(pnl, "panel_name")
pnl_mapping <- get_attr(pnl, "mapping")
<div class="vg-warning"><span>when writing other targets in your plan, it is advised to **not** refer to the `{blueprint}_initial` step since it could have problems which are discovered in the `{blueprint}_checks` step.</span></div>
pnl <-
pnl
```{r delete_cache, include=FALSE}
unlink(cache_location, recursive = TRUE)
class(pnl) <- c("mapped_df", class(pnl))
set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name)
}
) |>
bp_include_panelcleaner_meta()
```

When running this code with either `targets` or `drake`, the blueprint metadata is automatically created.
For our mtcars example, this looks like:
```{r, echo= FALSE}
mtcars_metadata <- system.file("project/blueprints/example/homogenized.csv", package = "blueprintr", mustWork = TRUE)
# Read this csv file:
mtcars_metadata |>
readr::read_csv()
```
Manually editing the metadata allows the user to add tests to check the data type and values.
And there you have it! You have created your first blueprint on the `mtcars` dataset.
When running a pipeline with `blueprintr`, the checks allow researchers to be warned of any issues at an early stage,
allowing them to produce replicable results.

0 comments on commit fce8a82

Please sign in to comment.