diff --git a/NEWS.md b/NEWS.md index 12c7cd4..389b8c4 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,7 @@ +# blueprintr 0.2.5.9000 (dev version) +* Updated the vignettes +* Added a new folder under `inst` to add metadata to the vignettes + # blueprintr 0.2.5 * Add capability to embed custom messages to check results, using `check.errors` attribute in returned logical value * Refactor side-effect messages from built-in checks to `check.errors` diff --git a/inst/mapping/mtcars_item_mapping.csv b/inst/mapping/mtcars_item_mapping.csv new file mode 100644 index 0000000..6732afb --- /dev/null +++ b/inst/mapping/mtcars_item_mapping.csv @@ -0,0 +1,13 @@ +"name_1","description_1","coding_1","panel","homogenized_name","homogenized_coding","homogenized_description" +"rn","Name of car","NA","MTCARS_PANEL","name","NA","Name of Car" +"mpg","Miles per gallon","NA","MTCARS_PANEL","mpg","NA","Miles per gallon" +"cyl","Number of cylinders","NA","MTCARS_PANEL","cyl","NA","Number of cylinders" +"disp","Displacement","NA","MTCARS_PANEL","disp","NA","Displacement" +"hp","Gross horsepower","NA","MTCARS_PANEL","hp","NA","Gross horsepower" +"drat","Rear axle ratio","NA","MTCARS_PANEL","drat","NA","Rear axle ratio" +"wt","Weight","NA","MTCARS_PANEL","wt","NA","Weight" +"qsec","Quarter mile time","NA","MTCARS_PANEL","qsec","NA","Quarter mile time" +"vs","Engine","coding(code(""1"",""1""), code(""0"", ""0""))","MTCARS_PANEL","vs","coding(code(""1"",""straight""), code(""0"", ""v-shaped""))","Engine" +"am","Transmission","coding(code(""1"",""1""), code(""0"", ""0""))","MTCARS_PANEL","am","coding(code(""1"",""manual""), code(""0"", ""automatic""))","Transmission" +"gear","Number of forward gears","NA","MTCARS_PANEL","gear","NA","Number of forward gears" +"carb","Number of carburetors","NA","MTCARS_PANEL","carb","NA","Number of carburetors" diff --git a/inst/project/blueprints/example/homogenized.csv b/inst/project/blueprints/example/homogenized.csv new file mode 100644 index 0000000..4eb0be9 --- /dev/null +++ b/inst/project/blueprints/example/homogenized.csv @@ -0,0 +1,14 @@ +"name","type","description","coding" +"name","character","Name of Car", +"mpg","double","Miles per gallon", +"cyl","double","Number of cylinders", +"disp","double","Displacement", +"hp","double","Gross horsepower", +"drat","double","Rear axle ratio", +"wt","double","Weight", +"qsec","double","Quarter mile time", +"vs","character","Engine","coding(code(""straight"",""1""), code(""v-shaped"",""0""))" +"am","character","Transmission","coding(code(""manual"",""1""), code(""automatic"",""0""))" +"gear","double","Number of forward gears", +"carb","double","Number of carburetors", +"wave","character",, diff --git a/vignettes/blueprintr.Rmd b/vignettes/blueprintr.Rmd index 7696bd4..d27b31f 100644 --- a/vignettes/blueprintr.Rmd +++ b/vignettes/blueprintr.Rmd @@ -1,5 +1,5 @@ --- -title: "A Walkthrough of blueprintr" +title: "Introduction to blueprintr" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{A Walkthrough of blueprintr} @@ -40,88 +40,112 @@ cache_location <- tempdir() drake::clean(cache = drake::drake_cache(cache_location)) ``` -blueprintr is a companion to [drake](https://github.com/ropensci/drake) that focuses on documenting and testing tabular data. Whereas drake manages the workflow execution, blueprintr defines a collection of steps that need to be run in a drake workflow. +`blueprintr` is a framework for managing your data assets in a reproducible fashion. While it uses [drake](https://github.com/ropensci/drake) or [targets](https://cran.r-project.org/web/packages/targets/), it adds automated steps for tabular dataset documentation and testing. This allows researchers to create a replicable framework to prevent programming issues from affecting analysis results. -# Basic Use +## Installation -The first, and recommended, step is to attach blueprintr to your R session with `library()`. +```{r setup, results= FALSE} +# install.packages("remotes") +# remotes::install_github("nyuglobalties/blueprintr") -```{r setup} library(blueprintr) ``` -In a [drake project](https://books.ropensci.org/drake/projects.html), all packages that you want attached are declared in a `"packages.R"` file. This `library(blueprintr)` command should go there. +## Designed Use of blueprintr +`blueprintr` provides your data with guardrails typically found in software engineering workflows. +This allows you to test and document before deploying to production. -blueprintr is built around "blueprints." Our first blueprint will be a blueprint for `mtcars`: +The top level of the `blueprintr` workflow is a "blueprints" directory, consisting of `.R` and `.csv` files. -```{r} -blueprint( - "mtcars_dat", - description = "The famous mtcars dataset", - command = { - mtcars - } -) -``` +### About blueprints +Each blueprint has two components to it: +* Data Construction Spec, usually a `.R` file that instructs drake or targets on how to build a specific dataset. +* Metadata, usually a `.csv` file that incorporates any mapping files and checks that need to be done on the dataset. -All blueprints have +In order to create a blueprint, we use the `blueprint` function. This function takes three arguments: name (the name of your generated dataset), description (a description of your dataset), command (any functions that need to be applied in order to build the dataset). -* A name (the first argument) for the _target_ dataset. -* A description or brief summary of what the target is. Can be `NULL`. -* A command, which is a quoted statement that has the code for building this target. -* A metadata location, which is a path to where the target metadata is saved. +A project may need only a few blueprints, but more likely you'll need nested blueprints to transform the data. -
the blueprint name is "mtcars_dat" rather than "mtcars". If the two had the same name, drake would determine that the blueprint has a _circular dependency_ (it depends on itself). To avoid this, blueprints should not have the same names as global variables, like `mtcars`.
+`blueprintr` generates six "steps" (targets) per blueprint: -To get this loaded into a drake plan, we need to _attach_ it to an already existing plan using +Target name | Description +------------------------|-------------- +`{blueprint}_initial` | The result of running the blueprint's `command` +`{blueprint}_blueprint` | A copy of the blueprint to be used throughout the plan +`{blueprint}_meta` | A copy of the dataset metadata --- if the metadata file doesn't exist, it will be created in this step +`{blueprint}_meta_path` | Creates the metadata file or loads it +`{blueprint}_checks` | Runs all tests on the `{blueprint}_initial` target +`{blueprint}` | The built dataset after running some cleanup tasks -```r -attach_blueprint(plan, blueprint) -attach_blueprints(plan, ...) -``` +
when writing other targets in your plan, it is advised to **not** refer to the `{blueprint}_initial` step since it could have problems which are discovered in the `{blueprint}_checks` step.
-
`attach_blueprints` accepts "[tidy dots](https://adv-r.hadley.nz/quasiquotation.html#tidy-dots)", so if you have a `list()` of blueprints, you can "splat" all of those blueprints into `attach_blueprints` like `attach_blueprints(plan, !!!list_of_blueprints)`
-If you don't have an existing plan, you can create one with +## Example -```r -plan_from_blueprint(blueprint) -``` +Let's take a well known dataset-- `mtcars`, and create a blueprint for it. -For now, we'll use an already existing plan, which is probably what you'll have most of the time. +```{r} +# Keeping the row names under the column `rn` +our_mtcars <- mtcars |> tibble::as_tibble(rownames = "rn") -```{r, include=FALSE} -existing_plan <- drake::drake_plan(initial_vector = runif(1000), squared = initial_vector ^ 2) +# Inspecting our mtcars dataset +head(our_mtcars) ``` +We can load in a user-created mapping file. This mapping file acts as a map for any variable name changes, as well as coding changes. ```{r} -attach_blueprint( - existing_plan, - blueprint( - "mtcars_dat", - description = "The famous mtcars dataset", - command = { - mtcars - } +mapping_file <- system.file("mapping/mtcars_item_mapping.csv", package = "blueprintr", mustWork = TRUE) +# Read this csv file: +item_mapping <- mapping_file |> + readr::read_csv( + col_types = readr::cols( + name_1 = readr::col_character(), + description_1 = readr::col_character(), + coding_1 = readr::col_character(), + panel = readr::col_character(), + homogenized_name = readr::col_character(), + homogenized_coding = readr::col_character(), + homogenized_description = readr::col_character() + ) ) -) +item_mapping ``` -blueprintr creates five "steps" (targets) per blueprint: - -Target name | Description -------------------------|-------------- -`{blueprint}_initial` | The result of running the blueprint's `command` -`{blueprint}_blueprint` | A copy of the blueprint to be used throughout the plan -`{blueprint}_meta` | A copy of the dataset metadata --- if the metadata file doesn't exist, it will be created in this step -`{blueprint}_checks` | Runs all checks on the `{blueprint}_initial` target -`{blueprint}` | The built dataset after running some cleanup tasks +Then, we typically use a tool such as `panelcleaner` to attach our mapping file to the `mtcars` database. +This is a command executed in the dataset construction spec. +```{r} +blueprint( + "mt_cars", + description = "mtcars database with attached metadata", + annotate = TRUE, + command = { + pnl <- panelcleaner::enpanel("MTCARS_PANEL", our_mtcars) |> + panelcleaner::add_mapping(item_mapping) |> + panelcleaner::homogenize_panel() |> + panelcleaner::bind_waves() |> + as.data.frame() -At this point, you're able to run `drake::make()` on this plan! + pnl_name <- get_attr(pnl, "panel_name") + pnl_mapping <- get_attr(pnl, "mapping") -
when writing other targets in your plan, it is advised to **not** refer to the `{blueprint}_initial` step since it could have problems which are discovered in the `{blueprint}_checks` step.
+ pnl <- + pnl -```{r delete_cache, include=FALSE} -unlink(cache_location, recursive = TRUE) + class(pnl) <- c("mapped_df", class(pnl)) + set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name) + } +) |> + bp_include_panelcleaner_meta() ``` - +When running this code with either `targets` or `drake`, the blueprint metadata is automatically created. +For our mtcars example, this looks like: +```{r, echo= FALSE} +mtcars_metadata <- system.file("project/blueprints/example/homogenized.csv", package = "blueprintr", mustWork = TRUE) +# Read this csv file: +mtcars_metadata |> + readr::read_csv() +``` +Manually editing the metadata allows the user to add tests to check the data type and values. +And there you have it! You have created your first blueprint on the `mtcars` dataset. +When running a pipeline with `blueprintr`, the checks allow researchers to be warned of any issues at an early stage, +allowing them to produce replicable results. \ No newline at end of file