feat: update basic use vignette with current workflow (#59)

nyuglobalties · Apr 28, 2024 · fce8a82 · fce8a82
1 parent 43fd44c
commit fce8a82
Show file tree

Hide file tree

Showing 4 changed files with 113 additions and 58 deletions.
diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,7 @@
+# blueprintr 0.2.5.9000 (dev version)
+* Updated the vignettes
+* Added a new folder under `inst` to add metadata to the vignettes
+
 # blueprintr 0.2.5
 * Add capability to embed custom messages to check results, using `check.errors` attribute in returned logical value
 * Refactor side-effect messages from built-in checks to `check.errors`

diff --git a/inst/mapping/mtcars_item_mapping.csv b/inst/mapping/mtcars_item_mapping.csv
@@ -0,0 +1,13 @@
+"name_1","description_1","coding_1","panel","homogenized_name","homogenized_coding","homogenized_description"
+"rn","Name of car","NA","MTCARS_PANEL","name","NA","Name of Car"
+"mpg","Miles per gallon","NA","MTCARS_PANEL","mpg","NA","Miles per gallon"
+"cyl","Number of cylinders","NA","MTCARS_PANEL","cyl","NA","Number of cylinders"
+"disp","Displacement","NA","MTCARS_PANEL","disp","NA","Displacement"
+"hp","Gross horsepower","NA","MTCARS_PANEL","hp","NA","Gross horsepower"
+"drat","Rear axle ratio","NA","MTCARS_PANEL","drat","NA","Rear axle ratio"
+"wt","Weight","NA","MTCARS_PANEL","wt","NA","Weight"
+"qsec","Quarter mile time","NA","MTCARS_PANEL","qsec","NA","Quarter mile time"
+"vs","Engine","coding(code(""1"",""1""), code(""0"", ""0""))","MTCARS_PANEL","vs","coding(code(""1"",""straight""), code(""0"", ""v-shaped""))","Engine"
+"am","Transmission","coding(code(""1"",""1""), code(""0"", ""0""))","MTCARS_PANEL","am","coding(code(""1"",""manual""), code(""0"", ""automatic""))","Transmission"
+"gear","Number of forward gears","NA","MTCARS_PANEL","gear","NA","Number of forward gears"
+"carb","Number of carburetors","NA","MTCARS_PANEL","carb","NA","Number of carburetors"
diff --git a/inst/project/blueprints/example/homogenized.csv b/inst/project/blueprints/example/homogenized.csv
@@ -0,0 +1,14 @@
+"name","type","description","coding"
+"name","character","Name of Car",
+"mpg","double","Miles per gallon",
+"cyl","double","Number of cylinders",
+"disp","double","Displacement",
+"hp","double","Gross horsepower",
+"drat","double","Rear axle ratio",
+"wt","double","Weight",
+"qsec","double","Quarter mile time",
+"vs","character","Engine","coding(code(""straight"",""1""), code(""v-shaped"",""0""))"
+"am","character","Transmission","coding(code(""manual"",""1""), code(""automatic"",""0""))"
+"gear","double","Number of forward gears",
+"carb","double","Number of carburetors",
+"wave","character",,
diff --git a/vignettes/blueprintr.Rmd b/vignettes/blueprintr.Rmd
@@ -1,5 +1,5 @@
 ---
-title: "A Walkthrough of blueprintr"
+title: "Introduction to blueprintr"
 output: rmarkdown::html_vignette
 vignette: >
   %\VignetteIndexEntry{A Walkthrough of blueprintr}
@@ -40,88 +40,112 @@ cache_location <- tempdir()
 drake::clean(cache = drake::drake_cache(cache_location))
 ```
 
-blueprintr is a companion to [drake](https://github.com/ropensci/drake) that focuses on documenting and testing tabular data. Whereas drake manages the workflow execution, blueprintr defines a collection of steps that need to be run in a drake workflow.
+`blueprintr` is a framework for managing your data assets in a reproducible fashion. While it uses [drake](https://github.com/ropensci/drake) or [targets](https://cran.r-project.org/web/packages/targets/), it adds automated steps for tabular dataset documentation and testing. This allows researchers to create a replicable framework to prevent programming issues from affecting analysis results.
 
-# Basic Use
+## Installation 
 
-The first, and recommended, step is to attach blueprintr to your R session with `library()`.
+```{r setup, results= FALSE}
+# install.packages("remotes")
+# remotes::install_github("nyuglobalties/blueprintr")
 
-```{r setup}
 library(blueprintr)
 ```
 
-In a [drake project](https://books.ropensci.org/drake/projects.html), all packages that you want attached are declared in a `"packages.R"` file. This `library(blueprintr)` command should go there.
+## Designed Use of blueprintr
+`blueprintr` provides your data with guardrails typically found in software engineering workflows. 
+This allows you to test and document before deploying to production.
 
-blueprintr is built around "blueprints." Our first blueprint will be a blueprint for `mtcars`:
+The top level of the `blueprintr` workflow is a "blueprints" directory, consisting of `.R` and `.csv` files. 
 
-```{r}
-blueprint(
-  "mtcars_dat",
-  description = "The famous mtcars dataset",
-  command = {
-    mtcars
-  }
-)
-```
+### About blueprints
+Each blueprint has two components to it: 
+* Data Construction Spec, usually a `.R` file that instructs drake or targets on how to build a specific dataset. 
+* Metadata, usually a `.csv` file that incorporates any mapping files and checks that need to be done on the dataset. 
 
-All blueprints have
+In order to create a blueprint, we use the `blueprint` function. This function takes three arguments: name (the name of your generated dataset), description (a description of your dataset), command (any functions that need to be applied in order to build the dataset). 
 
-* A name (the first argument) for the _target_ dataset.
-* A description or brief summary of what the target is. Can be `NULL`.
-* A command, which is a quoted statement that has the code for building this target.
-* A metadata location, which is a path to where the target metadata is saved.
+A project may need only a few blueprints, but more likely you'll need nested blueprints to transform the data. 
 
-<div class="vg-warning"><span>the blueprint name is "mtcars_dat" rather than "mtcars". If the two had the same name, drake would determine that the blueprint has a _circular dependency_ (it depends on itself). To avoid this, blueprints should not have the same names as global variables, like `mtcars`.</span></div>
+`blueprintr` generates six "steps" (targets) per blueprint:
 
-To get this loaded into a drake plan, we need to _attach_ it to an already existing plan using
+Target name             | Description
+------------------------|--------------
+`{blueprint}_initial`   | The result of running the blueprint's `command`
+`{blueprint}_blueprint` | A copy of the blueprint to be used throughout the plan
+`{blueprint}_meta`      | A copy of the dataset metadata --- if the metadata file doesn't exist, it will be created in this step
+`{blueprint}_meta_path` | Creates the metadata file or loads it
+`{blueprint}_checks`    | Runs all tests on the `{blueprint}_initial` target
+`{blueprint}`           | The built dataset after running some cleanup tasks
 
-```r
-attach_blueprint(plan, blueprint)
-attach_blueprints(plan, ...)
-```
+<div class="vg-warning"><span>when writing other targets in your plan, it is advised to **not** refer to the `{blueprint}_initial` step since it could have problems which are discovered in the `{blueprint}_checks` step.</span></div>
 
-<div class="vg-info"><span>`attach_blueprints` accepts "[tidy dots](https://adv-r.hadley.nz/quasiquotation.html#tidy-dots)", so if you have a `list()` of blueprints, you can "splat" all of those blueprints into `attach_blueprints` like `attach_blueprints(plan, !!!list_of_blueprints)`</span></div>
 
-If you don't have an existing plan, you can create one with
+## Example
 
-```r
-plan_from_blueprint(blueprint)
-```
+Let's take a well known dataset-- `mtcars`, and create a blueprint for it.
 
-For now, we'll use an already existing plan, which is probably what you'll have most of the time.
+```{r}
+# Keeping the row names under the column `rn`
+our_mtcars <- mtcars |> tibble::as_tibble(rownames = "rn")
 
-```{r, include=FALSE}
-existing_plan <- drake::drake_plan(initial_vector = runif(1000), squared = initial_vector ^ 2)
+# Inspecting our mtcars dataset
+head(our_mtcars)
 ```
 
+We can load in a user-created mapping file. This mapping file acts as a map for any variable name changes, as well as coding changes. 
 ```{r}
-attach_blueprint(
-  existing_plan,
-  blueprint(
-    "mtcars_dat",
-    description = "The famous mtcars dataset",
-    command = {
-      mtcars
-    }
+mapping_file <- system.file("mapping/mtcars_item_mapping.csv", package = "blueprintr", mustWork = TRUE)
+# Read this csv file:
+item_mapping <- mapping_file |>
+  readr::read_csv(
+    col_types = readr::cols(
+      name_1 = readr::col_character(),
+      description_1 = readr::col_character(),
+      coding_1 = readr::col_character(),
+      panel = readr::col_character(),
+      homogenized_name = readr::col_character(),
+      homogenized_coding = readr::col_character(),
+      homogenized_description = readr::col_character()
+    )
   )
-)
+item_mapping
 ```
 
-blueprintr creates five "steps" (targets) per blueprint:
-
-Target name             | Description
-------------------------|--------------
-`{blueprint}_initial`   | The result of running the blueprint's `command`
-`{blueprint}_blueprint` | A copy of the blueprint to be used throughout the plan
-`{blueprint}_meta`      | A copy of the dataset metadata --- if the metadata file doesn't exist, it will be created in this step
-`{blueprint}_checks`    | Runs all checks on the `{blueprint}_initial` target
-`{blueprint}`           | The built dataset after running some cleanup tasks
+Then, we typically use a tool such as `panelcleaner` to attach our mapping file to the `mtcars` database.
+This is a command executed in the dataset construction spec.
+```{r}
+blueprint(
+  "mt_cars",
+  description = "mtcars database with attached metadata",
+  annotate = TRUE,
+  command = {
+    pnl <- panelcleaner::enpanel("MTCARS_PANEL", our_mtcars) |>
+      panelcleaner::add_mapping(item_mapping) |>
+      panelcleaner::homogenize_panel() |>
+      panelcleaner::bind_waves() |>
+      as.data.frame()
 
-At this point, you're able to run `drake::make()` on this plan!
+    pnl_name <- get_attr(pnl, "panel_name")
+    pnl_mapping <- get_attr(pnl, "mapping")
 
-<div class="vg-warning"><span>when writing other targets in your plan, it is advised to **not** refer to the `{blueprint}_initial` step since it could have problems which are discovered in the `{blueprint}_checks` step.</span></div>
+    pnl <-
+      pnl
 
-```{r delete_cache, include=FALSE}
-unlink(cache_location, recursive = TRUE)
+    class(pnl) <- c("mapped_df", class(pnl))
+    set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name)
+  }
+) |>
+  bp_include_panelcleaner_meta()
 ```
-
+When running this code with either `targets` or `drake`, the blueprint metadata is automatically created.
+For our mtcars example, this looks like:
+```{r, echo= FALSE}
+mtcars_metadata <- system.file("project/blueprints/example/homogenized.csv", package = "blueprintr", mustWork = TRUE)
+# Read this csv file:
+mtcars_metadata |>
+  readr::read_csv()
+```
+Manually editing the metadata allows the user to add tests to check the data type and values. 
+And there you have it! You have created your first blueprint on the `mtcars` dataset. 
+When running a pipeline with `blueprintr`, the checks allow researchers to be warned of any issues at an early stage, 
+allowing them to produce replicable results.