diff --git a/ch-snippets.Rmd b/ch-snippets.Rmd index 478633a..89ea32a 100644 --- a/ch-snippets.Rmd +++ b/ch-snippets.Rmd @@ -110,6 +110,106 @@ ds <- vroom::vroom( rm(col_types) ``` + +Row Operations {#snippets-row} +------------------------------------ + +We frequently have to find the mean or sum across columns (within a row). +If +Finding mean across a lot of columns + +Here are several approaches for finding the mean across columns, without naming each column. Some remarks: + +* `m1` & `m2` are sanity checks for this example. + `m1` would be clumsy if you have 10+ items. + `m2` is discouraged because it's brittle. + A change in the column order could alter the calculation. + We prefer to use `grep()` to specify a sequence of items. +* Especially for large datasets, + I’d lean towards `m3` if the items are reasonably complete and + `m4` if some participants are missing enough items that their summary score is fishy. + In the approaches below, `m4` and `m6` return the mean only if the participant completed 2 or more items. +* `dplyr::rowwise()` is convenient, but slow for large datasets. +* If you need a more complex function that’s too clumsy to include directly in a `mutate()` statement, + see how the calculation for `m6` is delegated to the external function, `f6`. +* The technique behind `nonmissing` is pretty cool, + because you can apply an arbitrary function on each cell before they’re summed/averaged. +* This is in contrast to `f6()`, which applies to an entire (row-wise) data.frame. + +```r +# Isolate the columns to average. Remember the `grep()` approach w/ `colnames()` +columns_to_average <- c("hp", "drat", "wt") + +f6 <- function(x) { + # browser() + s <- sum(x, na.rm = TRUE) + n <- sum(!is.na(x)) + + dplyr::if_else( + 2L <= n, + s / n, + NA_real_ + ) +} + +mtcars |> + dplyr::mutate( + m1 = (hp + drat + wt) / 3, + m2 = + rowMeans( + dplyr::across(hp:wt), # All columns between hp & wt. + na.rm = TRUE + ), + m3 = + rowMeans( + dplyr::across(!!columns_to_average), + na.rm = TRUE + ), + s4 = # Finding the sum (used by m4) + rowSums( + dplyr::across(!!columns_to_average), + na.rm = TRUE + ), + nonmissing = + rowSums( + dplyr::across( + !!columns_to_average, + .fns = \(x) { !is.na(x) } + ) + ), + m4 = + dplyr::if_else( + 2 <= nonmissing, + s4 / nonmissing, + NA_real_ + ) + ) |> + dplyr::rowwise() |> # Required for `m5` + dplyr::mutate( + m5 = mean(dplyr::c_across(dplyr::all_of(columns_to_average))), + ) |> + dplyr::ungroup() |> # Clean up after rowwise() + dplyr::rowwise() |> # Required for `m6` + dplyr::mutate( + m6 = f6(dplyr::across(!!columns_to_average)) + ) |> + dplyr::ungroup() |> # Clean up after rowwise() + dplyr::select( + hp, + drat, + wt, + m1, + m2, + m3, + s4, + nonmissing, + m4, + m5, + m6, + ) +``` + + Grooming {#snippets-grooming} ------------------------------------ diff --git a/docs/ch-snippets.md b/docs/ch-snippets.md index 478633a..89ea32a 100644 --- a/docs/ch-snippets.md +++ b/docs/ch-snippets.md @@ -110,6 +110,106 @@ ds <- vroom::vroom( rm(col_types) ``` + +Row Operations {#snippets-row} +------------------------------------ + +We frequently have to find the mean or sum across columns (within a row). +If +Finding mean across a lot of columns + +Here are several approaches for finding the mean across columns, without naming each column. Some remarks: + +* `m1` & `m2` are sanity checks for this example. + `m1` would be clumsy if you have 10+ items. + `m2` is discouraged because it's brittle. + A change in the column order could alter the calculation. + We prefer to use `grep()` to specify a sequence of items. +* Especially for large datasets, + I’d lean towards `m3` if the items are reasonably complete and + `m4` if some participants are missing enough items that their summary score is fishy. + In the approaches below, `m4` and `m6` return the mean only if the participant completed 2 or more items. +* `dplyr::rowwise()` is convenient, but slow for large datasets. +* If you need a more complex function that’s too clumsy to include directly in a `mutate()` statement, + see how the calculation for `m6` is delegated to the external function, `f6`. +* The technique behind `nonmissing` is pretty cool, + because you can apply an arbitrary function on each cell before they’re summed/averaged. +* This is in contrast to `f6()`, which applies to an entire (row-wise) data.frame. + +```r +# Isolate the columns to average. Remember the `grep()` approach w/ `colnames()` +columns_to_average <- c("hp", "drat", "wt") + +f6 <- function(x) { + # browser() + s <- sum(x, na.rm = TRUE) + n <- sum(!is.na(x)) + + dplyr::if_else( + 2L <= n, + s / n, + NA_real_ + ) +} + +mtcars |> + dplyr::mutate( + m1 = (hp + drat + wt) / 3, + m2 = + rowMeans( + dplyr::across(hp:wt), # All columns between hp & wt. + na.rm = TRUE + ), + m3 = + rowMeans( + dplyr::across(!!columns_to_average), + na.rm = TRUE + ), + s4 = # Finding the sum (used by m4) + rowSums( + dplyr::across(!!columns_to_average), + na.rm = TRUE + ), + nonmissing = + rowSums( + dplyr::across( + !!columns_to_average, + .fns = \(x) { !is.na(x) } + ) + ), + m4 = + dplyr::if_else( + 2 <= nonmissing, + s4 / nonmissing, + NA_real_ + ) + ) |> + dplyr::rowwise() |> # Required for `m5` + dplyr::mutate( + m5 = mean(dplyr::c_across(dplyr::all_of(columns_to_average))), + ) |> + dplyr::ungroup() |> # Clean up after rowwise() + dplyr::rowwise() |> # Required for `m6` + dplyr::mutate( + m6 = f6(dplyr::across(!!columns_to_average)) + ) |> + dplyr::ungroup() |> # Clean up after rowwise() + dplyr::select( + hp, + drat, + wt, + m1, + m2, + m3, + s4, + nonmissing, + m4, + m5, + m6, + ) +``` + + Grooming {#snippets-grooming} ------------------------------------ diff --git a/docs/example-chapter.html b/docs/example-chapter.html index a672664..8333706 100644 --- a/docs/example-chapter.html +++ b/docs/example-chapter.html @@ -113,7 +113,7 @@

This intro was copied from the 1st chapter of the example bookdown repo. I’m keeping it temporarily for reference.

You can label chapter and section titles using {#label} after them, e.g., we can reference the Intro Chapter. If you do not manually label them, there will be automatic labels anyway

Figures and tables with captions will be placed in figure and table environments, respectively.

-
+
 par(mar = c(4, 4, .1, .1))
 plot(pressure, type = 'b', pch = 19)
@@ -123,7 +123,7 @@

Reference a figure by its code chunk label with the fig: prefix, e.g., see Figure G.1. Similarly, you can reference tables generated from knitr::kable(), e.g., see Table G.1.

-
+
 knitr::kable(
   head(iris, 20), caption = 'Here is a nice table!',
   booktabs = TRUE
diff --git a/docs/reference-keys.txt b/docs/reference-keys.txt
index edbc646..db31210 100644
--- a/docs/reference-keys.txt
+++ b/docs/reference-keys.txt
@@ -275,6 +275,7 @@ snippets-reading
 snippets-reading-excel
 snippets-reading-trailing-comma
 snippets-reading-vroom
+snippets-row
 snippets-grooming
 snippets-grooming-two-year
 snippets-identification
diff --git a/docs/search.json b/docs/search.json
index 30992bb..1c0b91b 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -1 +1 @@
-[{"path":"index.html","id":"intro","chapter":"1 Introduction","heading":"1 Introduction","text":"collection documents describe practices used OUHSC BBMC analytics projects.","code":""},{"path":"coding.html","id":"coding","chapter":"2 Coding Principles","heading":"2 Coding Principles","text":"","code":""},{"path":"coding.html","id":"coding-simplify","chapter":"2 Coding Principles","heading":"2.1 Simplify","text":"","code":""},{"path":"coding.html","id":"coding-simplify-types","chapter":"2 Coding Principles","heading":"2.1.1 Data Types","text":"Use simplest data type reasonable. simpler data type less likely contain unintended values. seen, string variable called gender can simultaneously contain values “m”, “f”, “F”, “Female”, “MALE”, “0”, “1”, “2”, “Latino”, ““, NA. hand, boolean variable gender_male can FALSE, TRUE, NA.1SQLite dedicated datatype, must resort storing 0, 1 NULL values. caller can’t assume ostensible boolean SQLite variable contains three values, variable checked.cleaned variable initial ETL files (like Ellis), establish boundaries spend time downstream files verifying bad values introduced. small bonus, simpler data types typically faster, consume less memory, translate cleanly across platforms.Within R, numeric-ish variables can represented following four data types. Use simplest type adequately captures information. logical simplest numeric flexible.logical/boolean/bit,integer,bit64::integer64, andnumeric/double-precision floats.Categorical variables similar spectrum. logical types, factors restrictive less flexible characters.2logical/boolean/bit,factor, andcharacter.","code":""},{"path":"coding.html","id":"coding-simplify-categorical","chapter":"2 Coding Principles","heading":"2.1.2 Categorical Levels","text":"boolean variable restrictive factor character required, choose simplest representation. possible:Use lower case (e.g., ‘male’ instead ‘Male’ gender variable).avoid repeating variable level (e.g., ‘control’ instead ‘control condition’ condition variable).","code":""},{"path":"coding.html","id":"coding-simplify-recoding","chapter":"2 Coding Principles","heading":"2.1.3 Recoding","text":"Almost every project recodes variables. Choose simplest function possible. functions top easier read harder mess functions itLeverage existing booleans: Suppose logical variable gender_male (can TRUE, FALSE, NA). Writing gender_male == TRUE gender_male == FALSE evaluate boolean –’s unnecessary gender_male already boolean.\nTesting TRUE: use variable (.e., gender_male instead gender_male == TRUE).\nTesting FALSE: use !. Write !gender_male instead gender_male == FALSE gender_male != TRUE.\nLeverage existing booleans: Suppose logical variable gender_male (can TRUE, FALSE, NA). Writing gender_male == TRUE gender_male == FALSE evaluate boolean –’s unnecessary gender_male already boolean.Testing TRUE: use variable (.e., gender_male instead gender_male == TRUE).Testing TRUE: use variable (.e., gender_male instead gender_male == TRUE).Testing FALSE: use !. Write !gender_male instead gender_male == FALSE gender_male != TRUE.Testing FALSE: use !. Write !gender_male instead gender_male == FALSE gender_male != TRUE.dplyr::coalesce(): function evaluates single variable replaces NA values another variable.\ncoalesce like\n\nvisit_completed <- dplyr::coalesce(visit_completed, FALSE)\nmuch easier read mess \n\nvisit_completed <- dplyr::if_else(!.na(visit_completed), visit_completed, FALSE)dplyr::coalesce(): function evaluates single variable replaces NA values another variable.coalesce likeis much easier read mess thandplyr::na_if() transforms nonmissing value NA.\nRecoding missing values like\n\nbirth_apgar <- dplyr::na_if(birth_apgar, 99)\neasier read mess \n\nbirth_apgar <- dplyr::if_else(birth_apgar == 99, NA_real_, birth_apgar)dplyr::na_if() transforms nonmissing value NA.Recoding missing values likeis easier read mess <= (similar comparison operator): Compare two quantities output boolean variable. parentheses unnecessary, can help readability. either value NA, result NA.\nNotice prefer order variables like number line. result TRUE, smaller value left larger value.\n\ndob_in_the_future   <- (Sys.Date() < dob)\ndod_follows_dob     <- (dob <= dod)\npremature           <- (gestation_weeks < 37)\nbig_boy             <- (threshold_in_kg <= birth_weight_in_kg)<= (similar comparison operator): Compare two quantities output boolean variable. parentheses unnecessary, can help readability. either value NA, result NA.Notice prefer order variables like number line. result TRUE, smaller value left larger value.dplyr::if_else(): function evaluates single boolean variable expression. output branches three possibilities: input () true, (b) false, (c) (optionally) NA. Notice unlike <= operator, dplyr::if_else() lets specify value input expression evaluates NA.\n\ndate_start  <- .Date(\"2017-01-01\")\n\n# missing month element needs handled explicitly.\nstage       <- dplyr::if_else(date_start <= month, \"post\", \"pre\", missing = \"missing-month\")\n\n# Otherwise simple boolean output sufficient.\nstage_post  <- (date_start <= month)\nimportant reader understand input expression NA produce NA, consider using dplyr::if_else(). Even though two lines equivalent, casual reader may consider stage_post NA.\n\nstage_post  <- (date_start <= month)\nstage_post  <- dplyr::if_else(date_start <= month, TRUE, FALSE, missing = NA)dplyr::if_else(): function evaluates single boolean variable expression. output branches three possibilities: input () true, (b) false, (c) (optionally) NA. Notice unlike <= operator, dplyr::if_else() lets specify value input expression evaluates NA.important reader understand input expression NA produce NA, consider using dplyr::if_else(). Even though two lines equivalent, casual reader may consider stage_post NA.dplyr::(): function evaluates numeric x left right boundary return boolean value. output TRUE x inside boundaries equal either boundary (.e., boundaries inclusive). output FALSE x outside either boundary.\n\ntoo_cold      <- 60\ntoo_hot       <- 88\ngoldilocks_1  <- dplyr::(temperature, too_cold, too_hot)\n\n# equivalent previous line.\ngoldilocks_2  <- (too_cold <= temperature & temperature <= too_hot)\nneed exclusive boundary, abandon dplyr::() specify exactly.\n\n# Left boundary exclusive\ngoldilocks_3  <- (too_cold < temperature & temperature <= too_hot)\n\n# boundaries exclusive\ngoldilocks_4  <- (too_cold < temperature & temperature <  too_hot)\ncode starts nest dplyr::() calls inside dplyr::if_else(), consider base::cut().dplyr::(): function evaluates numeric x left right boundary return boolean value. output TRUE x inside boundaries equal either boundary (.e., boundaries inclusive). output FALSE x outside either boundary.need exclusive boundary, abandon dplyr::() specify exactly.code starts nest dplyr::() calls inside dplyr::if_else(), consider base::cut().base::cut(): function transforms single numeric variable factor. range cut different segments/categories one-dimensional number line. output branches single discrete value (either factor-level integer). Modify right parameter FALSE ’d like left/lower bound inclusive (tends natural ).\n\nmtcars |>\n  tibble::as_tibble() |>\n  dplyr::select(\n    disp,\n  ) |>\n  dplyr::mutate(\n    # Example simple inequality operator (see two bullets )\n    muscle_car            = (300 <= disp),\n\n    # Divide `disp` three levels.\n    size_default_labels   = cut(disp, breaks = c(-Inf, 200, 300, Inf), right = F),\n\n    # Divide `disp` three levels custom labels.\n    size_cut3             = cut(\n      disp,\n      breaks = c(-Inf,   200,      300,   Inf),\n      labels = c(  \"small\", \"medium\", \"big\"),\n      right = FALSE  # right boundary INclusive ('FALSE' EXclusive boundary)\n    ),\n\n    # Divide `disp` five levels custom labels.\n    size_cut5             = cut(\n      disp,\n      breaks = c(-Inf,         100,            150,            200,      300,   Inf),\n      labels = c(  \"small small\", \"medium small\", \"biggie small\", \"medium\", \"big\"),\n      right = FALSE\n    ),\n  )base::cut(): function transforms single numeric variable factor. range cut different segments/categories one-dimensional number line. output branches single discrete value (either factor-level integer). Modify right parameter FALSE ’d like left/lower bound inclusive (tends natural ).dplyr::recode(): function accepts integer character variable. output branches single discrete value. example maps integers strings.\n\n# https://www.census.gov/quickfacts/fact/note/US/RHI625219\nrace_id        <- c(1L, 2L, 1L, 4L, 3L, 4L, 2L, NA_integer_)\nrace_id_spouse <- c(1L, 1L, 2L, 3L, 3L, 4L, 5L, NA_integer_)\nrace <-\n  dplyr::recode(\n    race_id,\n    \"1\"      = \"White\",\n    \"2\"      = \"Black African American\",\n    \"3\"      = \"American Indian Alaska Native\",\n    \"4\"      = \"Asian\",\n    \"5\"      = \"Native Hawaiian Pacific Islander\",\n    .missing = \"Unknown\"\n  )\nmultiple variables mapping, define mapping named vector, pass multiple calls dplyr::recode(). Notice two variables race race_spouse use mapping.3\n\nmapping_race <- c(\n  \"1\" = \"White\",\n  \"2\" = \"Black African American\",\n  \"3\" = \"American Indian Alaska Native\",\n  \"4\" = \"Asian\",\n  \"5\" = \"Native Hawaiian Pacific Islander\"\n)\nrace <-\n  dplyr::recode(\n    race_id,\n    !!!mapping_race,\n    .missing = \"Unknown\"\n  )\nrace_spouse <-\n  dplyr::recode(\n    race_id_spouse,\n    !!!mapping_race,\n    .missing = \"Unknown\"\n  )\nTips dplyr::recode():\nreusable dedicated mapping vector useful surveys 10+ Likert items consistent levels like “disagree”, “neutral”, “agree”.\nUse dplyr::recode_factor() map integers factor levels.\nforcats::fct_recode() similar. prefer .missing parameter dplyr::recode() translates NA explicit value.\nusing REDCap API, functions help convert radio buttons character factor variable.\ndplyr::recode(): function accepts integer character variable. output branches single discrete value. example maps integers strings.multiple variables mapping, define mapping named vector, pass multiple calls dplyr::recode(). Notice two variables race race_spouse use mapping.3Tips dplyr::recode():reusable dedicated mapping vector useful surveys 10+ Likert items consistent levels like “disagree”, “neutral”, “agree”.Use dplyr::recode_factor() map integers factor levels.forcats::fct_recode() similar. prefer .missing parameter dplyr::recode() translates NA explicit value.using REDCap API, functions help convert radio buttons character factor variable.lookup table: feasible recode 6 levels race directly R, ’s less feasible recode 200 provider names. Specify mapping csv, use readr convert csv data.frame, finally left join .lookup table: feasible recode 6 levels race directly R, ’s less feasible recode 200 provider names. Specify mapping csv, use readr convert csv data.frame, finally left join .dplyr::case_when(): function complicated can evaluate multiple input variables. Also, multiple cases can true, first output returned. ‘water fall’ execution helps complicated scenarios, overkill .dplyr::case_when(): function complicated can evaluate multiple input variables. Also, multiple cases can true, first output returned. ‘water fall’ execution helps complicated scenarios, overkill .","code":"\nvisit_completed <- dplyr::coalesce(visit_completed, FALSE)\nvisit_completed <- dplyr::if_else(!is.na(visit_completed), visit_completed, FALSE)\nbirth_apgar <- dplyr::na_if(birth_apgar, 99)\nbirth_apgar <- dplyr::if_else(birth_apgar == 99, NA_real_, birth_apgar)\ndob_in_the_future   <- (Sys.Date() < dob)\ndod_follows_dob     <- (dob <= dod)\npremature           <- (gestation_weeks < 37)\nbig_boy             <- (threshold_in_kg <= birth_weight_in_kg)\ndate_start  <- as.Date(\"2017-01-01\")\n\n# If a missing month element needs to be handled explicitly.\nstage       <- dplyr::if_else(date_start <= month, \"post\", \"pre\", missing = \"missing-month\")\n\n# Otherwise a simple boolean output is sufficient.\nstage_post  <- (date_start <= month)\nstage_post  <- (date_start <= month)\nstage_post  <- dplyr::if_else(date_start <= month, TRUE, FALSE, missing = NA)\ntoo_cold      <- 60\ntoo_hot       <- 88\ngoldilocks_1  <- dplyr::between(temperature, too_cold, too_hot)\n\n# This is equivalent to the previous line.\ngoldilocks_2  <- (too_cold <= temperature & temperature <= too_hot)\n# Left boundary is exclusive\ngoldilocks_3  <- (too_cold < temperature & temperature <= too_hot)\n\n# Both boundaries are exclusive\ngoldilocks_4  <- (too_cold < temperature & temperature <  too_hot)\nmtcars |>\n  tibble::as_tibble() |>\n  dplyr::select(\n    disp,\n  ) |>\n  dplyr::mutate(\n    # Example of a simple inequality operator (see two bullets above)\n    muscle_car            = (300 <= disp),\n\n    # Divide `disp` into three levels.\n    size_default_labels   = cut(disp, breaks = c(-Inf, 200, 300, Inf), right = F),\n\n    # Divide `disp` into three levels with custom labels.\n    size_cut3             = cut(\n      disp,\n      breaks = c(-Inf,   200,      300,   Inf),\n      labels = c(  \"small\", \"medium\", \"big\"),\n      right = FALSE  # Is the right boundary INclusive ('FALSE' is an EXclusive boundary)\n    ),\n\n    # Divide `disp` into five levels with custom labels.\n    size_cut5             = cut(\n      disp,\n      breaks = c(-Inf,         100,            150,            200,      300,   Inf),\n      labels = c(  \"small small\", \"medium small\", \"biggie small\", \"medium\", \"big\"),\n      right = FALSE\n    ),\n  )\n# https://www.census.gov/quickfacts/fact/note/US/RHI625219\nrace_id        <- c(1L, 2L, 1L, 4L, 3L, 4L, 2L, NA_integer_)\nrace_id_spouse <- c(1L, 1L, 2L, 3L, 3L, 4L, 5L, NA_integer_)\nrace <-\n  dplyr::recode(\n    race_id,\n    \"1\"      = \"White\",\n    \"2\"      = \"Black or African American\",\n    \"3\"      = \"American Indian and Alaska Native\",\n    \"4\"      = \"Asian\",\n    \"5\"      = \"Native Hawaiian or Other Pacific Islander\",\n    .missing = \"Unknown\"\n  )\nmapping_race <- c(\n  \"1\" = \"White\",\n  \"2\" = \"Black or African American\",\n  \"3\" = \"American Indian and Alaska Native\",\n  \"4\" = \"Asian\",\n  \"5\" = \"Native Hawaiian or Other Pacific Islander\"\n)\nrace <-\n  dplyr::recode(\n    race_id,\n    !!!mapping_race,\n    .missing = \"Unknown\"\n  )\nrace_spouse <-\n  dplyr::recode(\n    race_id_spouse,\n    !!!mapping_race,\n    .missing = \"Unknown\"\n  )"},{"path":"coding.html","id":"coding-defensive","chapter":"2 Coding Principles","heading":"2.2 Defensive Style","text":"","code":""},{"path":"coding.html","id":"coding-defensive-qualify-functions","chapter":"2 Coding Principles","heading":"2.2.1 Qualify functions","text":"Try prepend function package. Write dplyr::filter() instead filter(). two packages contain public functions name, package recently called library() takes precedent. multiple R files executed, packages’ precedents may predictable. Specifying package eliminates ambiguity, also making code easier follow. reason, recommend almost R files contain ‘load-packages’ chunk.See Google Style Guide qualifying functions.exceptions exist, including:sf package ’re using objects dplyr verbs.","code":""},{"path":"coding.html","id":"coding-defensive-date-arithmetic","chapter":"2 Coding Principles","heading":"2.2.2 Date Arithmetic","text":"Don’t use minus operator (.e., -) subtract dates. Instead use .integer(difftime(stop, start, units=\"days\")). ’s longer protects scenario start stop changed upstream date datetime. case, stop - start returns number seconds two points, number days.","code":""},{"path":"coding.html","id":"excluding-bad-cases","chapter":"2 Coding Principles","heading":"2.2.3 Excluding Bad Cases","text":"variables critical record, ’s missing, don’t want trust values. instance, hospital visit record rarely useful null patient ID. cases, prevent record passing ellis.example, ’ll presume trust patient record lacks clean date birth (dob).Define permissible range, either ellis’s declare-globals chunk, config-file. (’ll use config file example.) ’ll exclude anyone born 2000, tomorrow. Even though ’s illogical someone retrospective record born tomorrow, consider bending little small errors.\nrange_dob   : !expr c(.Date(\"2000-01-01\"), Sys.Date() + lubridate::days(1))Define permissible range, either ellis’s declare-globals chunk, config-file. (’ll use config file example.) ’ll exclude anyone born 2000, tomorrow. Even though ’s illogical someone retrospective record born tomorrow, consider bending little small errors.tweak-data chunk, use OuhscMunge::trim_date() set cell NA falls outside acceptable range. dplyr::mutate(), call tidyr::drop_na() exclude entire record, regardless () already NA, (b) “trimmed” NA.\n\nds <-\n  ds |>\n  dplyr::mutate(\n    dob = OuhscMunge::trim_date(dob, config$range_dob)\n  ) |>\n  tidyr::drop_na(dob)tweak-data chunk, use OuhscMunge::trim_date() set cell NA falls outside acceptable range. dplyr::mutate(), call tidyr::drop_na() exclude entire record, regardless () already NA, (b) “trimmed” NA.Even though ’s overkill trimming, (eventually) verify variable three reasons: () ’s chance code isn’t working expected, (b) later code might introduced bad values, (c) clearly documents reader dob included range stage pipeline.\n\ncheckmate::assert_date(ds$dob, .missing=F, lower=config$range_dob[1], upper=config$range_dob[2])Even though ’s overkill trimming, (eventually) verify variable three reasons: () ’s chance code isn’t working expected, (b) later code might introduced bad values, (c) clearly documents reader dob included range stage pipeline.","code":"range_dob   : !expr c(as.Date(\"2000-01-01\"), Sys.Date() + lubridate::days(1))\nds <-\n  ds |>\n  dplyr::mutate(\n    dob = OuhscMunge::trim_date(dob, config$range_dob)\n  ) |>\n  tidyr::drop_na(dob)\ncheckmate::assert_date(ds$dob, any.missing=F, lower=config$range_dob[1], upper=config$range_dob[2])"},{"path":"coding.html","id":"throw-errors-for-bad-cells","chapter":"2 Coding Principles","heading":"2.2.4 Throw errors for bad cells","text":"checkmate::assert_*() functions throw error stop R’s execution encountering vector violates constraints specified. previous snippet alert ifds$dob date,ds$dob least one NA value, ords$dob value earlier config$range_dob[1] later config$range_dob[2].package family functions accommodate many types vectors. common conditions verify :vector’s values unique, arises ’re upload primary key database (e.g., patient ID patient table),\n\ncheckmate::assert_integer(ds$pt_id, unique = TRUE)vector’s values unique, arises ’re upload primary key database (e.g., patient ID patient table),vector’s string follow strict pattern (e.g., patient ID “” “B”, followed 4 digits)\n\ncheckmate::assert_character(ds$pt_id, pattern = \"^[AB]\\\\d{4}$\")vector’s string follow strict pattern (e.g., patient ID “” “B”, followed 4 digits)database doesn’t accept names longer 50 characters\n\ncheckmate::assert_character(ds$name_first, min.chars = 50)\n# \ncheckmate::assert_character(ds$name_first, pattern = \"^.{0,50}$\")database doesn’t accept names longer 50 charactersThe pattern argument ultimately passed base::grepl(), leverage regular expressions.","code":"\ncheckmate::assert_integer(ds$pt_id, unique = TRUE)\ncheckmate::assert_character(ds$pt_id, pattern = \"^[AB]\\\\d{4}$\")\ncheckmate::assert_character(ds$name_first, min.chars = 50)\n# or\ncheckmate::assert_character(ds$name_first, pattern = \"^.{0,50}$\")"},{"path":"coding.html","id":"throw-errors-for-bad-conditions","chapter":"2 Coding Principles","heading":"2.2.5 Throw errors for bad conditions","text":"Sometimes dataset smells fishy even though single cell violates constraint. Send flare ’s kinda bad, yet stop execution really stinks.especially important recurring scripts process new datasets never inspected human, daily forecast. Even though today’s incoming dataset fine, shouldn’t trust next month’s. worst, lonely test never catches violation (wasted 5 minutes). best, catches problem proceeded undetected compromised downstream analyses.following snippet asserts ’s acceptable 2% patients missing age, never get worse 5%. Therefore throws error missingness exceeds 5% throws warning exceeds 2%.","code":"\n# Simulate a vector of ages.\nds <- tibble::tibble(\n  age = sample(c(NA, 1:19), size = 100, replace = TRUE)\n)\n\n# Define thresholds for errors & warnings.\nthreshold_error     <- .05\nthreshold_warning   <- .02\n\n# Calculate proportion of missing cells.\nmissing_proportion  <- mean(is.na(ds$age))\n\n# Accompany the error/warning with an informative message.\nif (threshold_error < missing_proportion) {\n  stop(\n    \"The proportion of missing `age` values is \", missing_proportion,\n    \", but it shouldn't exceed \", threshold_error, \".\"\n  )\n} else if (threshold_warning < missing_proportion) {\n  warning(\n    \"The proportion of missing `age` values is \", missing_proportion,\n    \", but ideally it stays below \", threshold_warning, \".\"\n  )\n}"},{"path":"architecture.html","id":"architecture","chapter":"3 Architecture Principles","heading":"3 Architecture Principles","text":"","code":""},{"path":"architecture.html","id":"encapsulation","chapter":"3 Architecture Principles","heading":"3.1 Encapsulation","text":"","code":""},{"path":"architecture.html","id":"leverage-team-members-strengths-avoid-weaknesses","chapter":"3 Architecture Principles","heading":"3.2 Leverage team member’s strengths & avoid weaknesses","text":"","code":""},{"path":"architecture.html","id":"focused-code-files","chapter":"3 Architecture Principles","heading":"3.2.1 Focused code files","text":"","code":""},{"path":"architecture.html","id":"metadata-for-content-experts","chapter":"3 Architecture Principles","heading":"3.2.2 Metadata for content experts","text":"","code":""},{"path":"architecture.html","id":"scales","chapter":"3 Architecture Principles","heading":"3.3 Scales","text":"","code":""},{"path":"architecture.html","id":"single-source-single-analysis","chapter":"3 Architecture Principles","heading":"3.3.1 Single source & single analysis","text":"","code":""},{"path":"architecture.html","id":"multiple-sources-multiple-analyses","chapter":"3 Architecture Principles","heading":"3.3.2 Multiple sources & multiple analyses","text":"","code":""},{"path":"architecture.html","id":"architecture-consistency","chapter":"3 Architecture Principles","heading":"3.4 Consistency","text":"","code":""},{"path":"architecture.html","id":"consistency-files","chapter":"3 Architecture Principles","heading":"3.4.1 Across Files","text":"","code":""},{"path":"architecture.html","id":"across-languages","chapter":"3 Architecture Principles","heading":"3.4.2 Across Languages","text":"","code":""},{"path":"architecture.html","id":"across-projects","chapter":"3 Architecture Principles","heading":"3.4.3 Across Projects","text":"","code":""},{"path":"prototype-r.html","id":"prototype-r","chapter":"4 Prototypical R File","heading":"4 Prototypical R File","text":"stated Consistency across Files, using consistent file structure can () improve quality code structure proven time facilitate good practices (b) allow intentions clear teammates familiar order intentions chunks.use term “chunk” section code corresponds knitr terminology (Xie 2015), many analysis files (opposed manipulation files), chunk R file connects knitr Rmd file.","code":""},{"path":"prototype-r.html","id":"chunk-clear","chapter":"4 Prototypical R File","heading":"4.1 Clear Memory","text":"initial chunk many files clear memory variables previous run. important developing debugging prevents previous runs contaminating subsequent runs. However little effect production; ’ll look manipulation files separately analysis files.Manipulation R files sourced argument local=new.env(). file executed fresh environment, variables clear. Analysis R files typically called Rmd file’s knitr::read_chunk(), code positioned first chunk called knitr 4.However typically clear memory R files sourced environment caller, interfere caller’s variables.","code":"\nrm(list = ls(all.names = TRUE))"},{"path":"prototype-r.html","id":"chunk-load-sources","chapter":"4 Prototypical R File","heading":"4.2 Load Sources","text":"first true chunk, source R files containing global variables functions current file requires. instance, team statisticians producing large report containing many analysis files, define many graphical elements single file. sourced file defines common color palettes graphical functions cosmetics uniform across analyses.prefer sourced files perform real action, importing data manipulating file. One reason difficult consistent environmental variables sourced file’s functions run. second reason cognitively difficult understand files connected.sourced file contains function definitions, operations can called time current file much tighter control variables modified. bonus discipline defining functions (instead executing functions) operations typically robust generalizable.Keep chunk even files sourced. empty chunk instructive readers trying determine files sourced. applies recommendation applies chunks discussed chapter. always, team agree set standards.","code":"\n# ---- load-sources ------------------------------------------------------------\nbase::source(file=\"./analysis/common/display-1.R\")      # Load common graphing functions."},{"path":"prototype-r.html","id":"chunk-load-packages","chapter":"4 Prototypical R File","heading":"4.3 Load Packages","text":"‘load-packages’ chunk declares required packages near file’s beginning three reasons. First, reader scanning file can quickly determine dependencies located single chunk. Second, machine lacking required package, best know early5. Third, style mimics requirement languages (declaring headers top C++ file) follows tidyverse style guide.discussed previous qualify functions section, recommend functions qualified package (e.g., foo::bar() instead merely bar()). Consequently, ‘load-packages’ chunk calls requireNamespace() frequently library(). requireNamespace() verifies package available local machine, load memory; library() verifies package available, loads .requireNamespace() used several scenarios.Core packages (e.g., ‘base’ ‘stats’) loaded R default installations. avoid unnecessary calls like library(stats) distract important features.Obvious dependencies called requireNamespace() library() similar reasons, especially called directly. example ‘tidyselect’ listed ‘tidyr’ listed.using version older R 4.16: “pipe” function (declared ‘magrittr’ package , .e., %>%) attached import::(magrittr, \"%>%\"). frequently-used function can called throughout execution without qualification.Compared manipulation files, analysis files tend use many functions concentrated packages conflicting function names less common. Typical packages used analysis ‘ggplot2’ ‘lme4’.sourced files may load packages (calling library()). important library() calls file follow ‘load-sources’ chunk identically-named functions (different packages) called correct precedent. Otherwise identically-named functions conflict namespace hard--predict results.Read R Packages library(), requireNamespace(), siblings, well larger concepts attaching functions search path.packages found manipulation files. Notice lesser-known packages quick explanation; helps maintainers decide declaration still necessary. Also notice packages distributed outside CRAN (e.g., GitHub) quick commented line help user install update package.","code":"\n# ---- load-packages -----------------------------------------------------------\n# import::from(magrittr, \"%>%\" )\n\nrequireNamespace(\"readr\"     )\nrequireNamespace(\"tidyr\"     )\nrequireNamespace(\"dplyr\"     )\nrequireNamespace(\"config\"    )\nrequireNamespace(\"checkmate\" ) # Asserts expected conditions\nrequireNamespace(\"OuhscMunge\") # remotes::install_github(repo=\"OuhscBbmc/OuhscMunge\")"},{"path":"prototype-r.html","id":"chunk-declare","chapter":"4 Prototypical R File","heading":"4.4 Declare Globals","text":"values repeatedly used within file, consider dedicating variable ’s defined set . also good place variables used , whose value central file’s mission. Typical variables ‘declare-globals’ chunk include data file paths, data file variables, color palettes, values config file.config file can coordinate static variable across multiple files. Centrally","code":"\n# ---- declare-globals ---------------------------------------------------------\n# Constant values that won't change.\nconfig                         <- config::get()\npath_db                        <- config$path_database\n\n# Execute to specify the column types.  It might require some manual adjustment (eg doubles to integers).\n#   OuhscMunge::readr_spec_aligned(config$path_subject_1_raw)\ncol_types <- readr::cols_only(\n  subject_id          = readr::col_integer(),\n  county_id           = readr::col_integer(),\n  gender_id           = readr::col_double(),\n  race                = readr::col_character(),\n  ethnicity           = readr::col_character()\n)"},{"path":"prototype-r.html","id":"chunk-load-data","chapter":"4 Prototypical R File","heading":"4.5 Load Data","text":"data ingested file occurs chunk. like think file linear pipe single point input single point output. Although possible file read data files line, recommend avoiding sprawl difficult humans understand. software developer deist watchmaker, file’s fate sealed end chunk. makes easier human reason isolate problems either existing () incoming data (b) calculations data.Ideally chunk consumes data either plain-text csv database.Many capable R functions packages ingest data. prefer tidyverse readr reading conventional files; younger cousin, vroom nice advantages working larger files forms jagged rectangles7. Depending file format, good packages consider data.table, haven, readxl, openxlsx, arrow, jsonlite, fst, yaml, rio.used Ellis, chunk likely consumes flat file like csv data metadata. used Ferry, Arch, Scribe, chunk likely consumes database table. used Analysis file, chunk likely consumes database table rds (.e., compressed R data file).large-scale scenarios, may series datasets held RAM simultaneously. first choice split R file new file subset datasets –words, R file probably given much responsibility. Occassionaly multiple datasets need considered , splitting R file option. scenarios, prefer upload datasets database, better manipulating datasets large RAM.R solution may loosen restriction dataset enter R file ‘load-data’ chunk. dataset processed longer needed, rm() removes RAM. Now another dataset can read file manipulated.loose scrap:\nchunk reads data (e.g., database table, networked CSV, local lookup table). chunk, new data introduced. sake reducing human cognition load. Everything chunk derived first four chunks.","code":""},{"path":"prototype-r.html","id":"chunk-tweak-data","chapter":"4 Prototypical R File","heading":"4.6 Tweak Data","text":"loose scrap:\n’s best rename dataset () single place (b) early pipeline, bad variable never referenced.","code":"\n# OuhscMunge::column_rename_headstart(ds) # Help write `dplyr::select()` call.\nds <-\n  ds |>\n  dplyr::select(    # `dplyr::select()` drops columns not included.\n    subject_id,\n    county_id,\n    gender_id,\n    race,\n    ethnicity\n  ) |>\n  dplyr::mutate(\n\n  ) |>\n  dplyr::arrange(subject_id) # |>\n  # tibble::rowid_to_column(\"subject_id\") # Add a unique index if necessary"},{"path":"prototype-r.html","id":"chunk-unique","chapter":"4 Prototypical R File","heading":"4.7 (Unique Content)","text":"section represents chunks tweak-data verify-values. chunks contain file’s creativity contribution. sense, structure first last chunks allow middle chunks focus concepts instead plumbing.simple files like ellis metadata file, may even need anything . complex analysis files may 200+ lines distributed across dozen chunks. recommend create dedicate chunk conceptual stage. one starts contain ~20 lines, consider granular organization clarify code’s intent.","code":""},{"path":"prototype-r.html","id":"chunk-verify-values","chapter":"4 Prototypical R File","heading":"4.8 Verify Values","text":"Running OuhscMunge::verify_value_headstart(ds) ","code":"\n# ---- verify-values -----------------------------------------------------------\n# Sniff out problems\n# OuhscMunge::verify_value_headstart(ds)\ncheckmate::assert_integer(  ds$county_month_id    , any.missing=F , lower=1, upper=3080                , unique=T)\ncheckmate::assert_integer(  ds$county_id          , any.missing=F , lower=1, upper=77                            )\ncheckmate::assert_date(     ds$month              , any.missing=F , lower=as.Date(\"2012-06-15\"), upper=Sys.Date())\ncheckmate::assert_character(ds$county_name        , any.missing=F , pattern=\"^.{3,12}$\"                          )\ncheckmate::assert_integer(  ds$region_id          , any.missing=F , lower=1, upper=20                            )\ncheckmate::assert_numeric(  ds$fte                , any.missing=F , lower=0, upper=40                            )\ncheckmate::assert_logical(  ds$fte_approximated   , any.missing=F                                                )\ncheckmate::assert_numeric(  ds$fte_rolling_median , any.missing=T , lower=0, upper=40                            )\n\ncounty_month_combo   <- paste(ds$county_id, ds$month)\ncheckmate::assert_character(county_month_combo, pattern  =\"^\\\\d{1,2} \\\\d{4}-\\\\d{2}-\\\\d{2}$\", any.missing=F, unique=T)"},{"path":"prototype-r.html","id":"chunk-specify-columns","chapter":"4 Prototypical R File","heading":"4.9 Specify Output Columns","text":"chunk:verifies variables exist uploading,documents (troubleshooting developers) variables product file, andreorders variables match expected structure.Variable order especially important database engines/drivers ignore variable name, use variable position.use term ‘slim’ typically output fewer variables full dataset processed file.doubt variable needed downstream, leave dplyr::select(), commented . someone needs future, ’ll easily determine might come , uncomment line (possibly modify database table). import column warehouse multiple people using, can tough remove without breaking code.chunk follows verify-values sometimes want check validity variables consumed downstream. variables important , illegal value may reveal larger problem dataset.","code":"\n# Print colnames that `dplyr::select()`  should contain below:\n#   cat(paste0(\"    \", colnames(ds), collapse=\",\\n\"))\n\n# Define the subset of columns that will be needed in the analyses.\n#   The fewer columns that are exported, the fewer things that can break downstream.\n\nds_slim <-\n  ds |>\n  # dplyr::slice(1:100) |>\n  dplyr::select(\n    subject_id,\n    county_id,\n    gender_id,\n    race,\n    ethnicity\n  )\n\nds_slim"},{"path":"prototype-r.html","id":"save-to-disk-or-database","chapter":"4 Prototypical R File","heading":"4.10 Save to Disk or Database","text":"","code":""},{"path":"prototype-r.html","id":"additional-resources","chapter":"4 Prototypical R File","heading":"4.11 Additional Resources","text":"(Colin Gillespie 2017), particularly “Efficient input/output” chapter.","code":""},{"path":"prototype-sql.html","id":"prototype-sql","chapter":"5 Prototypical SQL File","heading":"5 Prototypical SQL File","text":"New data scientists typically import entire tables database R, merge, filter, groom data.frames. efficient approach submit sql executes database returns specialized dataset.provides several advantages:database much efficient filtering joining tables programing language, R Python. well-designed database indexed columns optimizations surpass R Python capabilities.database handles datasets thousands times larger R Python can accommodate RAM. large datasets, database engines persist data hard drive (instead just RAM) optimized read necessary information RAM moment needed, return processed back disk progressing next block data.Frequently, portion table’s rows columns ultimately needed analysis. Reducing size dataset leaving database two benefits: less information travels across network R’s Python’s limited memory space conserved.scenarios, desirable use INSERT SQL command transfer data within database; never travel across network never touch R local machine. large complicated projects, majority data movement uses INSERT commands within SQL files. Among scenarios, analysis-focused projects use R call sequence SQL files (see flow.R), database-focused project uss SSIS.cases, try write SQL files conform similar standards conventions. stated Consistency across Files (previous chapter), using consistent file structure can () improve quality code structure proven time facilitate good practices (b) allow intentions clear teammates familiar order intentions chunks.","code":""},{"path":"prototype-sql.html","id":"sql-choice","chapter":"5 Prototypical SQL File","heading":"5.1 Choice of Database Engine","text":"major relational database engines use roughly syntax, slight deviations enhancements beyond SQL standards. databases hosted SQL Server, since OUHSC’s campus seems comfortable supporting. Consequently, chapter uses SQL Server 2017+ syntax.like data science teams, still need consume databases, Oracle MySQL. Outside OUHSC projects, tend use PostgreSQL Redshift.","code":""},{"path":"prototype-sql.html","id":"sql-ferry","chapter":"5 Prototypical SQL File","heading":"5.2 Ferry","text":"basic sql file moves data within database create table named dx, contained ley_covid_1 schema cdw_staging database.","code":"--use cdw_staging\ndeclare @start_date date = '2020-02-01';                               -- sync with config.yml\ndeclare @stop_date  date = dateadd(day, -1, cast(getdate() as date));  -- sync with config.yml\n\nDROP TABLE if exists ley_covid_1.dx;\nCREATE TABLE ley_covid_1.dx(\n  dx_id           int identity  primary key,\n  patient_id      int           not null,\n  covid_confirmed bit           not null,\n  problem_date    date,\n  icd10_code      varchar(20)   not null\n);\n-- TRUNCATE TABLE ley_covid_1.dx;\n\nINSERT INTO ley_covid_1.dx\nSELECT\n  pr.patient_id\n  ,ss.covid_confirmed\n  ,pr.invoice_date     as problem_date\n  ,pr.code             as icd10_code\n  -- into ley_covid_1.dx\nFROM cdw.star_1.fact_problem       as pr\n  inner join beasley_covid_1.ss_dx as ss on pr.code = ss.icd10_code\nWHERE\n  pr.problem_date_start between @start_date and @stop_date\n  and\n  pr.patient_id is not null\nORDER BY pr.patient_id, pr.problem_date_start desc\n\nCREATE INDEX ley_covid_1_dx_patient_id on ley_covid_1.dx (patient_id);\nCREATE INDEX ley_covid_1_dx_icd10_code on ley_covid_1.dx (icd10_code);"},{"path":"prototype-sql.html","id":"sql-default-database","chapter":"5 Prototypical SQL File","heading":"5.3 Default Databases","text":"prefer specify database table, instead control connection (DSN’s “default database” value). Nevertheless, ’s helpful include default database behind comment two reasons. First, communicates default database human reader. Second, debugging, code can highlighted ADS/SSMS executed “F5”; mimic happens file run via automation DSN.","code":"--use cdw_staging"},{"path":"prototype-sql.html","id":"sql-declare","chapter":"5 Prototypical SQL File","heading":"5.4 Declare Values Databases","text":"Similar Declare Globals chunk prototypical R file, values set top file easy read modify.","code":"declare @start_date date = '2020-02-01';                               -- sync with config.yml\ndeclare @stop_date  date = dateadd(day, -1, cast(getdate() as date));  -- sync with config.yml"},{"path":"prototype-sql.html","id":"sql-recreate","chapter":"5 Prototypical SQL File","heading":"5.5 Recreate Table","text":"batch-loading data, typically easiest drop recreate database table. snippet , table specific name dropped/deleted database replaced (possibly new) definition. like dedicate line table column, least three elements per line: name, data type, nulls allowed.Many features keywords available designing tables. ones occasionally use :primary key helps database optimization later querying table, enforces uniqueness, patient table two rows patient_id value. Primary keys must nonmissing, null keyword redundant.unique helpful table additional columns need unique (patient_ssn patient_id). advanced scenario using clustered columnar table, incompatible primary key designation.identity(1, 1) creates 1, 2, 3, … sequence, relieves client creating sequence something like row_number(). Note identity column exists, number columns SELECT clause one fewer columns defined CREATE TABLE.jump-start creation table definition, frequently use clause. operation creates new table, informed column properties source tables. Within ADS SSMS, refresh list tables select new table; option copy CREATE TABLE statement (similar snippet ) paste sql file. definition can modified, tightening null null.","code":"DROP TABLE if exists ley_covid_1.dx;\nCREATE TABLE ley_covid_1.dx(\n  dx_id           int identity(1, 1) primary key,\n  patient_id      int         not null,\n  covid_confirmed bit         not null,\n  problem_date    date            null,\n  icd10_code      varchar(20) not null\n);  -- into ley_covid_1.dx"},{"path":"prototype-sql.html","id":"sql-truncate","chapter":"5 Prototypical SQL File","heading":"5.6 Truncate Table","text":"scenarios table definition stable data refreshed frequently (say, daily), consider TRUNCATE-ing table. taking approach, prefer keep DROP CREATE code file, commented . saves development time future table definition needs modified.","code":"-- TRUNCATE TABLE ley_covid_1.dx;"},{"path":"prototype-sql.html","id":"sql-insert","chapter":"5 Prototypical SQL File","heading":"5.7 INSERT INTO","text":"INSERT (followed SELECT clause), simply moves data query specified table.INSERT clause transfers columns exact order query. try match names destination table. error thrown column types mismatched (e.g., attempting insert character string integer value).Even worse, error thrown mismatched columns compatible types. occur table’s columns patient_id, weight_kg, height_cm, query’s columns patient_id, height_cm, weight_in. weight height written incorrect columns, execution catch source weight_kg, destination weight_in.","code":"INSERT INTO ley_covid_1.dx"},{"path":"prototype-sql.html","id":"sql-select","chapter":"5 Prototypical SQL File","heading":"5.8 SELECT","text":"SELECT clause specifies desired columns. can also rename columns perform manipulations.prefer specify aliased table column. two source tables column name, error thrown regarding ambiguity. Even ’s concern, believe explicitly specifying source improves readability reduces errors.","code":"SELECT\n  pr.patient_id\n  ,ss.covid_confirmed\n  ,cast(pr.invoice_datetime as date) as problem_date\n  ,pr.code                           as icd10_code"},{"path":"prototype-sql.html","id":"sql-from","chapter":"5 Prototypical SQL File","heading":"5.9 FROM","text":"","code":"FROM cdw.star_1.fact_problem       as pr\n  inner join beasley_covid_1.ss_dx as ss on pr.code = ss.icd10_code"},{"path":"prototype-sql.html","id":"sql-where","chapter":"5 Prototypical SQL File","heading":"5.10 WHERE","text":"clause reduces number returned rows (opposed reducing number columns SELECT clause). Use indention level communicate reader subclauses combined. especially important operators used, since order operations can confused easily.","code":"WHERE\n  pr.problem_date_start between @start_date and @stop_date\n  and\n  pr.patient_id is not null"},{"path":"prototype-sql.html","id":"sql-order-by","chapter":"5 Prototypical SQL File","heading":"5.11 ORDER BY","text":"ORDER clause simply specifies order rows. default, column’s values ascending order, can descending desired.","code":"ORDER BY pr.patient_id, pr.problem_date_start desc"},{"path":"prototype-sql.html","id":"sql-indexing","chapter":"5 Prototypical SQL File","heading":"5.12 Indexing","text":"table large queried variety ways, indexing table can speed performance dramatically.","code":"CREATE INDEX ley_covid_1_dx_patient_id on ley_covid_1.dx (patient_id);\nCREATE INDEX ley_covid_1_dx_icd10_code on ley_covid_1.dx (icd10_code);"},{"path":"prototype-repo.html","id":"prototype-repo","chapter":"6 Prototypical Repository","heading":"6 Prototypical Repository","text":"following file repository structure supported wide spectrum projects, ranging () small, short-term retrospective project one dataset, one manipulation file, one analysis report (b) large, multi-year project fed dozens input files support multiple statisticians sophisticated enrollment process.Looking beyond single project, strongly encourage team adopt common file organization. Pursuing commonality provides multiple benefits:evolved thought-structure makes easier follow good practices avoid common traps.evolved thought-structure makes easier follow good practices avoid common traps.Code files portable projects. code can reused environments refer files directories like config.yml, data-public/raw, data-public/derivedCode files portable projects. code can reused environments refer files directories like config.yml, data-public/raw, data-public/derivedPeople portable projects. person already familiar structure, start contributing quickly already know look statistical reports analysis/ debug problematic file ingestions manipulation/ files.People portable projects. person already familiar structure, start contributing quickly already know look statistical reports analysis/ debug problematic file ingestions manipulation/ files.specific project doesn’t use directory file, recommend retaining stub. Like empty chunks discusses Prototypical R File chapter, stub communicates collaborator, “project currently doesn’t use feature, /, location”. collaborator can stop search immediately, avoid searching weird places order rule-feature located elsewhere.template worked well us publicly available https://github.com/wibeasley/RAnalysisSkeleton. important files directories described . Please use starting point, dogmatic prison. Make adjustments fits specific project overall team.","code":""},{"path":"prototype-repo.html","id":"repo-root","chapter":"6 Prototypical Repository","heading":"6.1 Root","text":"following files live repository’s root directory, meaning subfolder/subdirectory.","code":""},{"path":"prototype-repo.html","id":"repo-config","chapter":"6 Prototypical Repository","heading":"6.1.1 config.R","text":"configuration file simply plain-text yaml file read config package. well-suited value coordinated across multiple files.Also see discussion use config file excluding bad data values config file relates yaml, json, xml.","code":"default:\n  # To be processed by Ellis lanes\n  path_subject_1_raw:  \"data-public/raw/subject-1.csv\"\n  path_mlm_1_raw:      \"data-public/raw/mlm-1.csv\"\n\n  # Central Database (produced by Ellis lanes).\n  path_database:       \"data-public/derived/db.sqlite3\"\n\n  # Analysis-ready datasets (produced by scribes & consumed by analyses).\n  path_mlm_1_derived:  \"data-public/derived/mlm-1.rds\"\n\n  # Metadata\n  path_annotation:     \"data-public/metadata/cqi-annotation.csv\"\n\n  # Logging errors and messages from automated execution.\n  path_log_flow:       !expr strftime(Sys.time(), \"data-unshared/log/flow-%Y-%m-%d--%H-%M-%S.log\")\n\n  # time_zone_local       :  \"America/Chicago\" # Force local time, in case remotely run.\n\n  # ---- Validation Ranges & Patterns ----\n  range_record_id         : !expr c(1L, 999999L)\n  range_dob               : !expr c(as.Date(\"2010-01-01\"), Sys.Date() + lubridate::days(1))\n  range_datetime_entry    : !expr c(as.POSIXct(\"2019-01-01\", tz=\"America/Chicago\"), Sys.time())\n  max_age                 : 25\n  pattern_mrn             : \"^E\\\\d{9}$\"  # An 'E', followed by 9 digits."},{"path":"prototype-repo.html","id":"repo-flow","chapter":"6 Prototypical Repository","heading":"6.1.2 flow.R","text":"workflow repo determined flow.R. calls (typically R, Python, SQL) files specific order, sending log messages file.See automation mediators details.","code":""},{"path":"prototype-repo.html","id":"repo-readme","chapter":"6 Prototypical Repository","heading":"6.1.3 README.md","text":"readme automatically displayed GitHub repository opened browser. Include static information can quickly orientate collaborator. Common elements include:Project Name (see style guide naming recommendations)Principal Investigator (ultimately accountable research) Project Coordinator (easy contact questions arise)IRB Tracking Number (whatever oversight committee reviewed approved project). help communicate accurately within larger university company.Abstract project description already written (example, part IRB submission).Documentation locations resources, described documentation/ section belowData Locations resources, \ndatabase database server\nREDCap project id url\nnetworked file share\ndatabase database serverREDCap project id urlnetworked file shareThe PI’s expectations goals analysis teamLikely deadlines, grant conference submission datesEach directory can readme file, (typical analysis projects) discourage putting much individual readme. ’ve found becomes cumbersome keep scattered files updated consistent; ’s also work reader traverse directory structure reading everything. approach concentrate information repo’s root readme, remaining readmes static unchanged across projects (e.g., generic description data-public/metadata/).","code":""},{"path":"prototype-repo.html","id":"repo-rproj","chapter":"6 Prototypical Repository","heading":"6.1.4 *.Rproj","text":"Rproj file stores project-wide settings used RStudio IDE, trailing whitespaces handled. file’s major benefit sets R session’s working directory, facilitates good discipline setting constant location files repo. Although plain-text file can edited directly, recommend using RStudio’s dialog box. good documentation Rproj settings. unsure, copy file repo’s root directory rename match repo exactly.","code":""},{"path":"prototype-repo.html","id":"repo-manipulation","chapter":"6 Prototypical Repository","heading":"6.2 manipulation/","text":"","code":""},{"path":"prototype-repo.html","id":"repo-analysis","chapter":"6 Prototypical Repository","heading":"6.3 analysis/","text":"sense, directories exist support contents analysis/. exploratory, descriptive, inferential statistics produced Rmd files. subdirectory name report, (e.g., analysis/report-te-1) within directory four files:R file contains meat analysis (e.g., analysis/report-te-1/report-te-1.R).Rmd file serves “presentation layer” calls R file (e.g., analysis/report-te-1/report-te-1.Rmd).markdown file produced directly Rmd (e.g., analysis/report-te-1/report-te-1.md). people consider intermediate file exists mostly knitr/rmarkdown/pandoc produce eventual html file.html file derived markdown file (e.g., analysis/report-te-1/report-te-1.html). markdown html files can safely discarded reproduced next time Rmd rendered. tables graphs html file self-contained, meaning single file portable emailed without concern directory read . Collaborators rarely care manipulation files analysis code; almost always look exclusively outputed html.","code":""},{"path":"prototype-repo.html","id":"repo-data-public","chapter":"6 Prototypical Repository","heading":"6.4 data-public/","text":"directory contain information sensitive proprietary. hold PHI (Protected Health Information), information like participant names, social security numbers, passwords. Files PHI stored GitHub repository, even private GitHub repository.Please see data-unshared/ options storing sensitive information.data-public/ directory typically works best organized subdirectories. commonly use subdirectories, corresponds Data Rest chapter.","code":""},{"path":"prototype-repo.html","id":"data-publicraw","chapter":"6 Prototypical Repository","heading":"6.4.1 data-public/raw/","text":"…input pipelines. datasets usually represents hard work data collection.","code":""},{"path":"prototype-repo.html","id":"data-publicmetadata","chapter":"6 Prototypical Repository","heading":"6.4.2 data-public/metadata/","text":"…definitions datasets raw. example, “gender.csv” might translate values 1 2 male female. Sometimes dataset feels natural either raw metadata subdirectory. file remain unchanged subsequent sample collected, lean towards metadata.","code":""},{"path":"prototype-repo.html","id":"data-publicderived","chapter":"6 Prototypical Repository","heading":"6.4.3 data-public/derived/","text":"…output pipelines. contents completely reproducible starting data-public/raw/ repo’s code. words, can deleted recreated ease. might contain small database file, like SQLite.","code":""},{"path":"prototype-repo.html","id":"data-publiclogs","chapter":"6 Prototypical Repository","heading":"6.4.4 data-public/logs/","text":"…logs useful collaborators necessary demonstrate something future, beyond reports contained analysis/ directory.","code":""},{"path":"prototype-repo.html","id":"data-publicoriginal","chapter":"6 Prototypical Repository","heading":"6.4.5 data-public/original/","text":"…nothing (hopefully); ideally never used. similar data-public/raw/. difference data-public/raw/ called pipeline code, data-public/original/ .file data-public/original/ typically comes investigator malformed state requires manual intervention; copied data-public/raw/. Common offenders () csv Excel file bad missing column headers, (b) strange file format readable R package, (c) corrupted file require rehabilitation utility.","code":""},{"path":"prototype-repo.html","id":"characteristics","chapter":"6 Prototypical Repository","heading":"6.4.6 Characteristics","text":"characteristics data-public/ vary based subject matter. instance, medical research projects typically use metadata directory repo, incoming information contains PHI therefore database preferred location. hand, microbiology physics research typically data protected law, desirable repo contain everything ’s unnecessarily spread .feel private GitHub repo offers adequate protection scooped biggest risk.","code":""},{"path":"prototype-repo.html","id":"repo-data-unshared","chapter":"6 Prototypical Repository","heading":"6.5 data-unshared/","text":"Files directory stored local computer, committed sent central GitHub repository/server. makes folder candidate :sensitive information, PHI (Protected Health Information). PHI involved, recommend data-unshared/ database secured networked file share feasible. See discussion .sensitive information, PHI (Protected Health Information). PHI involved, recommend data-unshared/ database secured networked file share feasible. See discussion .huge public files say, files 1+ GB easily downloadable reproducible. instance, files stable sources like US Census, Bureau Labor Statistics, dataverse.org.huge public files say, files 1+ GB easily downloadable reproducible. instance, files stable sources like US Census, Bureau Labor Statistics, dataverse.org.diagnostic logs useful collaborators.diagnostic logs useful collaborators.line repo’s .gitignore file blocks directory’s contents staged/committed (look /data-unshared/*). Since files directory committed, requires discipline communicate files collaborator’s computer. List files either repo’s readme data-unshared/contents.md; minimum declare name file can downloaded reproduced. (curious, !data-unshared/contents.md line .gitignore declares exception markdown file committed updated collaborator’s machine.)Even though files kept central repository, recommend encrypting local drive data-unshared/ contains sensitive data (PHI). See data-public/ README.md information.directory works best subdirectories described organization data-public/.Compared data-unshared/, prefer storing PHI enterprise database (SQL Server, PostgreSQL, MariaDB/MySQL, Oracle) networked drive four reasons.central resources typically managed Campus reviewed security professionals.’s trivial stay synchronized across collaborators file share database. contrast, data-unshared/ isn’t synchronized across machines extra discipline required tell collaborators update machines.’s sometimes possible recover lost data file share database. ’s much less likely turn back clock data-unshared/ files.’s unlikely mess .gitignore entries allow sensitive files committed repository. sensitive information stored data-unshared/, important review every commit ensure information isn’t sneak repo.","code":""},{"path":"prototype-repo.html","id":"repo-documentation","chapter":"6 Prototypical Repository","heading":"6.6 documentation/","text":"Good documentation scarce documentation files consume little space, liberally copy everything get directory. helpful include:Approval letters IRB oversight board. especially important also gatekeeper database, must justify releasing sensitive information.Data dictionaries incoming datasets team ingesting.Data dictionaries derived datasets team producing.documentation public stable, like CDC’s site vaccination codes, include url repo’s readme. feel information location may change, copy url also full document easier reconstruct logic returning project years.","code":""},{"path":"prototype-repo.html","id":"repo-optional","chapter":"6 Prototypical Repository","heading":"6.7 Optional","text":"Everything mentioned now exist repo, even file directory empty. projects benefit following additional capabilities.","code":""},{"path":"prototype-repo.html","id":"repo-description","chapter":"6 Prototypical Repository","heading":"6.7.1 DESCRIPTION","text":"plain-text DESCRIPTION file lives repo’s root directory –see example R Analysis Skeleton. file allows repo become R package, provides following benefits even never deployed CRAN.specify packages (versions) required code. include packages aren’t available CRAN, like OuhscBbmc/OuhscMunge.better unify test common code called multiple files.better document functions datasets within repo.last two bullets essentially upgrade merely sticking code file sourcing .package offers many capabilities beyond listed , typical data science repo scratch surface. larger topic covered Hadley Wickham’s R Packages.","code":""},{"path":"prototype-repo.html","id":"repo-utility","chapter":"6 Prototypical Repository","heading":"6.7.2 utility/","text":"Include files may run occasionally, required reproduce analyses. Examples include:code submitting entire repo pipeline super computer,simulate artificial demonstration data, orrunning diagnostic checks code using something like goodpractice urlchecker.","code":""},{"path":"prototype-repo.html","id":"repo-stitched","chapter":"6 Prototypical Repository","heading":"6.7.3 stitched-output/","text":"Stitching light-weight capability knitr/rmarkdown. stitch repo’s files (server type logging), consider directing output directory. basic call :don’t use approach medical research, sensitive information usually contained output, sensitive patient information stored repo. (’s last time ’ll talk sensitive information –least chapter.)","code":"\nknitr::stitch_rmd(\n  script = \"manipulation/car-ellis.R\",\n  output = \"stitched-output/manipulation/car-ellis.md\"\n)"},{"path":"rest.html","id":"rest","chapter":"7 Data at Rest","heading":"7 Data at Rest","text":"","code":""},{"path":"rest.html","id":"rest-states","chapter":"7 Data at Rest","heading":"7.1 Data States","text":"extension data-public/ discussion. chapter theoretical applies forms data, just files prototypical repo.easiest demarcate data two states: raw derived. Raw data represents input pipelines. Sometimes junk. usually files cherished culmination hard work data collection. Derived data represents output pipelines. contents completely reproducible starting raw data repo’s code. words, derived information can deleted recreated ease.terminology, original data file directly received collaborator. good day, “original” “raw” synonymous. Meaning files received ingestible directly pipeline. However sometimes collaborator provides malformed data file requires manual intervention. rehabilitated, becomes raw data. Common offenders () csv Excel file bad missing column headers, (b) strange file format readable R package, (c) corrupted file require rehabilitation utility.original file isn’t perfect, ’ll decide blemishes can programmatically fixed, blemishes manually fixed. triage process, sometimes difficult determine worth investing time fix code. everything can fixed code, original raw data equivalent (“original” state can ignored).heuristics help decide address manually programmatically.Arguments Programmatic Fixes:original data frequently refreshed. pipeline ingests new files every day, ’s probably worth investment fix.original data frequently refreshed. pipeline ingests new files every day, ’s probably worth investment fix.*code *code wouldArguments Manual Fixes:corrections subjective. Sometimes desired fix follow deterministic rules. scenarios, see “Return file collaborator” alternative.corrections subjective. Sometimes desired fix follow deterministic rules. scenarios, see “Return file collaborator” alternative.’s quick fix one-time dataset.’s quick fix one-time dataset.Alternatives:Return file collaborator. Especially grad students interns available. One justification ’re usually experts field, . better equipped evaluate data point context determine correct correction. second justification company/university probably doesn’t want pay statisticians data scientists clean upIf corporate consultant, propose team willing fix data points provide estimated cost training personnel correctly evaluate context client can offload task.Separate excise manual step. majority file can ingested without manual intervention, try split task two. Consider patient’s visit record hospital database. information well-structured easily transformed discrete cells. However “visit notes” written nurses physician . Sometimes notes areSeparate excise manual step. majority file can ingested without manual intervention, try split task two. Consider patient’s visit record hospital database. information well-structured easily transformed discrete cells. However “visit notes” written nurses physician . Sometimes notes areRawRawDerived\nProject-wide File Repo\nProject-wide File Protected File Server\nUser-specific File Protected File Server\nProject-wide Database\nDerivedProject-wide File RepoProject-wide File Protected File ServerUser-specific File Protected File ServerProject-wide DatabaseOriginalOriginal","code":""},{"path":"rest.html","id":"data-containers","chapter":"7 Data at Rest","heading":"7.2 Data Containers","text":"","code":""},{"path":"rest.html","id":"rest-containers-csv","chapter":"7 Data at Rest","heading":"7.2.1 csv","text":"exchanging data two different systems, preferred format frequently plain text, cell record separated comma. commonly called csv –comma separated value file. opposed proprietary formats like xlsx sas7bdat, csv file easily opened parsable statistical software, even conventional text editors GitHub.","code":""},{"path":"rest.html","id":"rest-containers-rds","chapter":"7 Data at Rest","heading":"7.2.2 rds","text":"","code":""},{"path":"rest.html","id":"rest-containers-yaml","chapter":"7 Data at Rest","heading":"7.2.3 yaml, json, and xml","text":"yaml, json, xml three plain-text hierarchical formats commonly used data structure naturally represented rectangle set rectangles (therefore good fit csv rds). unsure start nested dataset, see tidyr’s Rectangling vignette.way advocate simplest recoding function adequate task, prefer yaml json, json xml. Yaml accommodates , needs. Initially may tricky correctly use whitespacing specify correct nesting structure yaml, familar, file easy read edit, Git diffs can quickly reviewed. yaml package reads yaml file, returns (nested) R list; can also convert R list yaml file.config package wraps yaml package fill common need: retrieving repository configuration information yaml file. recommend using config package fits. ways functionality simplification yaml package, extension ways. example, value follows !expr, R evaluate expression. commonly specify allowable ranges variables config.ymlSee discussion config.yml prototypical repository, well.","code":"range_dob: !expr c(as.Date(\"2010-01-01\"), Sys.Date() + lubridate::days(1))"},{"path":"rest.html","id":"rest-containers-arrow","chapter":"7 Data at Rest","heading":"7.2.4 Arrow","text":"Apache Arrow open source specification developed work many languages R, Spark, Python, many others. accommodates nice rectangles CSVs used, hierarchical nesting json xml used.-memory specification (allows Python process directly access R object), -disk specification (allows Python process read saved R file). file format compressed, takes much less space store disk less time transfer network.downside file plain-text, binary. means file readable editable many programs, hurts project’s portability. wouldn’t want store metadata files arrow collaborators couldn’t easily help map values qqq","code":""},{"path":"rest.html","id":"rest-containers-sqlite","chapter":"7 Data at Rest","heading":"7.2.5 SQLite","text":"","code":""},{"path":"rest.html","id":"rest-containers-database","chapter":"7 Data at Rest","heading":"7.2.6 Central Enterprise database","text":"","code":""},{"path":"rest.html","id":"rest-containers-redcap","chapter":"7 Data at Rest","heading":"7.2.7 Central REDCap database","text":"","code":""},{"path":"rest.html","id":"rest-containers-avoid","chapter":"7 Data at Rest","heading":"7.2.8 Containers to avoid","text":"","code":""},{"path":"rest.html","id":"rest-containers-avoid-spreadsheets","chapter":"7 Data at Rest","heading":"7.2.8.1 Spreadsheets","text":"Try receive data Excel files. think Excel can useful light brainstorming prototyping equations –trusted transport serious information. spreadsheet software like LibreOffice Calc less problematic experience, still less desirable formats mentioned .receive csv open typical spreadsheet program, strongly recommend save , potential mangling values. close spreadsheet, review Git commits verify values corrupted.See appendix list ways analyses can undermined receiving Excel files, well template correspond less-experienced colleagues sending team Excel files.","code":""},{"path":"rest.html","id":"rest-containers-avoid-proprietary","chapter":"7 Data at Rest","heading":"7.2.8.2 Proprietary","text":"Proprietary formats like SAS’s “sas7bdat” less accessible people without current expensive software licenses. Therefore distributing proprietary file formats hurts reproducibility decreases project’s impact. hand, using proprietary formats may advantageous need conceal project’s failure.formerly distributed sas7bdat files supplement (otherwise identical) csvs, order cater suprisingly large population SAS users unfamiliar proc import Google search engine. Recently distributed csvs, example code reading file SAS.","code":""},{"path":"rest.html","id":"data-conventions","chapter":"7 Data at Rest","heading":"7.3 Storage Conventions","text":"","code":""},{"path":"rest.html","id":"rest-conventions-all","chapter":"7 Data at Rest","heading":"7.3.1 All Sources","text":"Across file formats, conventions usually work best.consistency across versions: use script produce dataset, inform recipient dataset’s structure changes. processes automated, changes trivial humans (e.g., yyyy-mm-dd mm/dd-yy) break automation.\nspecificity automation intentional. install guards processes bad values pass. instance, may place bounds toddlers’ age 12 36 months. want automation break next dataset contains age values 1 3 (years). downstream analysis (say, regression model age predictor variable) produce misleading results shift months years went undetected.consistency across versions: use script produce dataset, inform recipient dataset’s structure changes. processes automated, changes trivial humans (e.g., yyyy-mm-dd mm/dd-yy) break automation.specificity automation intentional. install guards processes bad values pass. instance, may place bounds toddlers’ age 12 36 months. want automation break next dataset contains age values 1 3 (years). downstream analysis (say, regression model age predictor variable) produce misleading results shift months years went undetected.date format: specify YYYY-MM-DD (ISO-8601)date format: specify YYYY-MM-DD (ISO-8601)time format: specify HH:MM HH:MM:SS, preferably 24-hour time. Use leading zero midnight 9:59am, colon separating hours, minutes, seconds (.e., 09:59)time format: specify HH:MM HH:MM:SS, preferably 24-hour time. Use leading zero midnight 9:59am, colon separating hours, minutes, seconds (.e., 09:59)patient names: separate name_last, name_first, name_middle three distinct variables possible.patient names: separate name_last, name_first, name_middle three distinct variables possible.currency: represent money integer floating-point variable. representation easily parsable software, enables mathematical operations (like max() mean()) performed directly. Avoid commas symbols like “$”. possibility ambiguity, indicate denomination variable name (e.g., payment_dollars payment_euros).currency: represent money integer floating-point variable. representation easily parsable software, enables mathematical operations (like max() mean()) performed directly. Avoid commas symbols like “$”. possibility ambiguity, indicate denomination variable name (e.g., payment_dollars payment_euros).","code":""},{"path":"rest.html","id":"rest-conventions-text","chapter":"7 Data at Rest","heading":"7.3.2 Text","text":"conventions usually work best within plain-text formats.csv: comma separated values common plain-text format, better support similar formats cells separated tabs semi-colons. However, receiving well-behaved file separated characters, thankful go flow.csv: comma separated values common plain-text format, better support similar formats cells separated tabs semi-colons. However, receiving well-behaved file separated characters, thankful go flow.cells enclosed quotes: ‘cell’ enclosed double quotes, especially ’s string/character variable.cells enclosed quotes: ‘cell’ enclosed double quotes, especially ’s string/character variable.","code":""},{"path":"rest.html","id":"rest-conventions-excel","chapter":"7 Data at Rest","heading":"7.3.3 Excel","text":"discussed avoid Excel. possible, conventions helps reduce ambiguity corrupted values. See appendix preferred approach reading Excel files.avoid multiple tabs/worksheets: Excel files containing multiple worksheets complicated read automation, produces opportunities inconsistent variables across tabs/worksheets.avoid multiple tabs/worksheets: Excel files containing multiple worksheets complicated read automation, produces opportunities inconsistent variables across tabs/worksheets.save cells text: avoiding Excel attempting save cells dates numbers. Admitedly, last-ditch effort. someone using Excel convert cells text, values probably already corrupted.save cells text: avoiding Excel attempting save cells dates numbers. Admitedly, last-ditch effort. someone using Excel convert cells text, values probably already corrupted.","code":""},{"path":"rest.html","id":"rest-conventions-meditech","chapter":"7 Data at Rest","heading":"7.3.4 Meditech","text":"patient identifier: mrn_meditech instead mrn, MRN Rec#, Med Rec#.patient identifier: mrn_meditech instead mrn, MRN Rec#, Med Rec#.account/admission identifier: account_number instead mrn, Acct#, Account#.account/admission identifier: account_number instead mrn, Acct#, Account#.patient’s full name: name_full instead Patient Name Name.patient’s full name: name_full instead Patient Name Name.long/tall format: one row per dx per patient (50 dxs) instead 50 columns dx per patient. Applies \ndiagnosis code & description\norder date & number\nprocedure name & number\nlong/tall format: one row per dx per patient (50 dxs) instead 50 columns dx per patient. Applies todiagnosis code & descriptiondiagnosis code & descriptionorder date & numberorder date & numberprocedure name & numberprocedure name & numberMeditech Idiosyncracies:blood pressure: systems bp_diastolic bp_systolic values stored separate integer variables. Meditech, stored single character variable, separated forward slash.","code":""},{"path":"rest.html","id":"rest-conventions-database","chapter":"7 Data at Rest","heading":"7.3.5 Databases","text":"exchanging data two different systems, …","code":""},{"path":"patterns.html","id":"patterns","chapter":"8 Patterns","heading":"8 Patterns","text":"","code":""},{"path":"patterns.html","id":"pattern-ellis","chapter":"8 Patterns","heading":"8.1 Ellis","text":"","code":""},{"path":"patterns.html","id":"purpose","chapter":"8 Patterns","heading":"8.1.1 Purpose","text":"incorporate outside data source system safely.","code":""},{"path":"patterns.html","id":"philosophy","chapter":"8 Patterns","heading":"8.1.2 Philosophy","text":"Without data immigration, warehouses useless. Embrace power fresh information way :\nrepeatable data source updated (refresh warehouse)\nsimilar Ellis lanes (designed data sources) don’t learn/remember entirely new pattern. (Like Rubiks cube instructions.)\nWithout data immigration, warehouses useless. Embrace power fresh information way :repeatable data source updated (refresh warehouse)similar Ellis lanes (designed data sources) don’t learn/remember entirely new pattern. (Like Rubiks cube instructions.)","code":""},{"path":"patterns.html","id":"guidelines","chapter":"8 Patterns","heading":"8.1.3 Guidelines","text":"Take small bites.\nLike software development, don’t tackle complexity first time. Start processing important columns incorporating move.\nUse variables need short-term, especially new projects. everyone knows, variables upstream source can change. Don’t spend effort writing code variables won’t need months/years; ’ll likely change need .\nrow passes verify-values chunk, ’re accountable failures causes warehouse. analysts know external data messy, don’t surprised. Sometimes ’ll spend hour writing Ellis 6 columns.\nTake small bites.Like software development, don’t tackle complexity first time. Start processing important columns incorporating move.Use variables need short-term, especially new projects. everyone knows, variables upstream source can change. Don’t spend effort writing code variables won’t need months/years; ’ll likely change need .row passes verify-values chunk, ’re accountable failures causes warehouse. analysts know external data messy, don’t surprised. Sometimes ’ll spend hour writing Ellis 6 columns.Narrowly define Ellis lane. One code file strive () consume one CSV (b) produce one table. Exceptions include:\nmultiple input files related, really belong together (e.g., one CSV per month, one CSV per clinic). scenario pretty common.\nCSV legitimately produce two different tables munging. happens infrequently, one warehouse table needs wide, another long.\nNarrowly define Ellis lane. One code file strive () consume one CSV (b) produce one table. Exceptions include:multiple input files related, really belong together (e.g., one CSV per month, one CSV per clinic). scenario pretty common.CSV legitimately produce two different tables munging. happens infrequently, one warehouse table needs wide, another long.","code":""},{"path":"patterns.html","id":"examples","chapter":"8 Patterns","heading":"8.1.4 Examples","text":"https://github.com/wibeasley/RAnalysisSkeleton/blob/main/manipulation/te-ellis.Rhttps://github.com/wibeasley/RAnalysisSkeleton/blob/main/manipulation/https://github.com/OuhscBbmc/usnavy-billets/blob/main/manipulation/survey-ellis.R","code":""},{"path":"patterns.html","id":"elements","chapter":"8 Patterns","heading":"8.1.5 Elements","text":"Clear memory scripting languages like R (unlike compiled languages like Java), ’s easy old variables hang around. Explicitly clear run file .\n\nrm(list = ls(= TRUE)) # Clear memory variables previous run. called knitr, first chunk.Clear memory scripting languages like R (unlike compiled languages like Java), ’s easy old variables hang around. Explicitly clear run file .Load Sources R, source()d file run execute code. prefer sourced file load variables (like function definitions), instead real operations like read dataset perform calculation. many times want function available multiple files repo; two approaches like. first collecting common functions single file (sourcing callers). second make repo legitimate R package.\nfirst approach better suited quick & easy development. second allows add documentation unit tests.\n\n# ---- load-sources ------------------------------------------------------------\nsource(\"./manipulation/osdh/ellis/common-ellis.R\")Load Sources R, source()d file run execute code. prefer sourced file load variables (like function definitions), instead real operations like read dataset perform calculation. many times want function available multiple files repo; two approaches like. first collecting common functions single file (sourcing callers). second make repo legitimate R package.first approach better suited quick & easy development. second allows add documentation unit tests.Load Packages another precaution necessary scripting language. Determine necessary packages available machine. Avoiding attaching packages (library() function) possible. functions don’t need qualified (e.g., dplyr::intersect()) cause naming conflicts. Even can guarantee don’t conflict packages now, packages add new functions future conflict.\n\n# ---- load-packages -----------------------------------------------------------\n# Attach package(s) functions need qualified: http://r-pkgs..co.nz/namespace.html#search-path\nlibrary(magrittr            , quietly=TRUE)\nlibrary(DBI                 , quietly=TRUE)\n\n# Verify packages available machine, functions need qualified: http://r-pkgs..co.nz/namespace.html#search-path\nrequireNamespace(\"readr\"        )\nrequireNamespace(\"tidyr\"        )\nrequireNamespace(\"dplyr\"        ) # Avoid attaching dplyr, b/c function names conflict lot packages (esp base, stats, plyr).\nrequireNamespace(\"testit\")\nrequireNamespace(\"checkmate\")\nrequireNamespace(\"OuhscMunge\") # remotes::install_github(repo=\"OuhscBbmc/OuhscMunge\")Load Packages another precaution necessary scripting language. Determine necessary packages available machine. Avoiding attaching packages (library() function) possible. functions don’t need qualified (e.g., dplyr::intersect()) cause naming conflicts. Even can guarantee don’t conflict packages now, packages add new functions future conflict.Declare Global Variables Functions. includes defining expected column names types data sources; use readr::cols_only() (opposed readr::cols()) ignore new columns may added since dataset’s last refresh.\n\n# ---- declare-globals ---------------------------------------------------------Declare Global Variables Functions. includes defining expected column names types data sources; use readr::cols_only() (opposed readr::cols()) ignore new columns may added since dataset’s last refresh.Load Data Source(s) See load-data chunk described prototypical file.\n\n# ---- load-data ---------------------------------------------------------------Load Data Source(s) See load-data chunk described prototypical file.Tweak Data\nSee tweak-data chunk described prototypical file.\n\n# ---- tweak-data --------------------------------------------------------------Tweak DataSee tweak-data chunk described prototypical file.Body EllisBody EllisVerifyVerifySpecify Columns\nSee specify-columns--upload chunk described prototypical file.\n\n# ---- specify-columns--upload -----------------------------------------------Specify ColumnsSee specify-columns--upload chunk described prototypical file.Welcome warehouse. chunk, nothing persisted.\n\n# ---- save--db --------------------------------------------------------------\n# ---- save--disk ------------------------------------------------------------Welcome warehouse. chunk, nothing persisted.","code":"\nrm(list = ls(all = TRUE)) # Clear the memory of variables from previous run. This is not called by knitr, because it's above the first chunk.\n# ---- load-sources ------------------------------------------------------------\nsource(\"./manipulation/osdh/ellis/common-ellis.R\")\n# ---- load-packages -----------------------------------------------------------\n# Attach these package(s) so their functions don't need to be qualified: http://r-pkgs.had.co.nz/namespace.html#search-path\nlibrary(magrittr            , quietly=TRUE)\nlibrary(DBI                 , quietly=TRUE)\n\n# Verify these packages are available on the machine, but their functions need to be qualified: http://r-pkgs.had.co.nz/namespace.html#search-path\nrequireNamespace(\"readr\"        )\nrequireNamespace(\"tidyr\"        )\nrequireNamespace(\"dplyr\"        ) # Avoid attaching dplyr, b/c its function names conflict with a lot of packages (esp base, stats, and plyr).\nrequireNamespace(\"testit\")\nrequireNamespace(\"checkmate\")\nrequireNamespace(\"OuhscMunge\") # remotes::install_github(repo=\"OuhscBbmc/OuhscMunge\")\n# ---- declare-globals ---------------------------------------------------------\n# ---- load-data ---------------------------------------------------------------\n# ---- tweak-data --------------------------------------------------------------\n# ---- specify-columns-to-upload -----------------------------------------------\n# ---- save-to-db --------------------------------------------------------------\n# ---- save-to-disk ------------------------------------------------------------"},{"path":"patterns.html","id":"pattern-arch","chapter":"8 Patterns","heading":"8.2 Arch","text":"","code":""},{"path":"patterns.html","id":"pattern-ferry","chapter":"8 Patterns","heading":"8.3 Ferry","text":"","code":""},{"path":"patterns.html","id":"pattern-scribe","chapter":"8 Patterns","heading":"8.4 Scribe","text":"","code":""},{"path":"patterns.html","id":"pattern-analysis","chapter":"8 Patterns","heading":"8.5 Analysis","text":"","code":""},{"path":"patterns.html","id":"pattern-presentation-static","chapter":"8 Patterns","heading":"8.6 Presentation -Static","text":"","code":""},{"path":"patterns.html","id":"pattern-presentation-interactive","chapter":"8 Patterns","heading":"8.7 Presentation -Interactive","text":"","code":""},{"path":"patterns.html","id":"pattern-metadata","chapter":"8 Patterns","heading":"8.8 Metadata","text":"Survey items can change across time (justified unjustified reasons). prefer dedicate metadata csv single variablehttps://github.com/LiveOak/vasquez-mexican-census-1/issues/17#issuecomment-567254695","code":""},{"path":"patterns.html","id":"primary-rules-for-mapping","chapter":"8 Patterns","heading":"8.8.1 Primary Rules for Mapping","text":"important rules necessary map concepts multidimensional space.variable gets csv, relationship.csv (show ), education.csv, living-status.csv, race.csv. ’s easiest file name matches variable.variable gets csv, relationship.csv (show ), education.csv, living-status.csv, race.csv. ’s easiest file name matches variable.variable also needs unique integer identifies underlying level database, education_id, living_status_id, relationship_id.variable also needs unique integer identifies underlying level database, education_id, living_status_id, relationship_id.survey wave gets column within csv, code_2011 code_2016.survey wave gets column within csv, code_2011 code_2016.level within variable-wave gets row, like Jefe, Esposo, Hijo.level within variable-wave gets row, like Jefe, Esposo, Hijo.","code":""},{"path":"patterns.html","id":"secondary-rules-for-mapping","chapter":"8 Patterns","heading":"8.8.2 Secondary Rules for Mapping","text":"scenarios, first three columns critical (.e., relationship_id, code_2011, code_2016). Yet additional guidelines help plumbing manipulation lookup variables.variable also needs unique name identifies underlying level human, education, living_status, relationship. human label corresponding relationship_id. ’s easiest column name matches variable.variable also needs unique name identifies underlying level human, education, living_status, relationship. human label corresponding relationship_id. ’s easiest column name matches variable.survey wave gets column within csv, description_2011 description_2016. human labels corresponding variables like code_2011 code_2016.survey wave gets column within csv, description_2011 description_2016. human labels corresponding variables like code_2011 code_2016.variable benefits unique display order value, used later analyses. Categorical variables typically desired sequence graph legends tables; specify order . helps define factor levels R pandas.Categorical levels Python.variable benefits unique display order value, used later analyses. Categorical variables typically desired sequence graph legends tables; specify order . helps define factor levels R pandas.Categorical levels Python.Mappings usually informed outside documentation. transparency maintainability, clearly describe documentation can found. One option include data-public/metadata/README.md. Another option include bottom csv, preceded #, ‘comment’ character can keep csv-parser treating notes like data needs squeeze cells. Notes example :\n# Notes,,,,,,\n# 2016 codes come `documentation/2106/fd_endireh2016_dbf.pdf`, pages 14-15,,,,,\n# 2011 codes come `documentation/2011/fd_endireh11.xls`, ‘TSDem’ tab,,,,,Mappings usually informed outside documentation. transparency maintainability, clearly describe documentation can found. One option include data-public/metadata/README.md. Another option include bottom csv, preceded #, ‘comment’ character can keep csv-parser treating notes like data needs squeeze cells. Notes example :sometimes notes column helps humans keep things straight, especially researchers new field/project. example , notes value first row might “jefe means ‘head’, ‘boss’”.sometimes notes column helps humans keep things straight, especially researchers new field/project. example , notes value first row might “jefe means ‘head’, ‘boss’”.","code":"# Notes,,,,,,\n# 2016 codes come from `documentation/2106/fd_endireh2016_dbf.pdf`, pages 14-15,,,,,\n# 2011 codes come from `documentation/2011/fd_endireh11.xls`, ‘TSDem’ tab,,,,,"},{"path":"security.html","id":"security","chapter":"9 Security & Private Data","heading":"9 Security & Private Data","text":"Overview{Include paragraphs describe principles mentality, following sections contribute.}report’s dataset(s) preferably stored REDCap SQL Server.\n’re absolutely stored GitHub local machine.\nAvoid Microsoft Access, Excel, CSVs, anything without user accounts.\nPHI must stored loose file (eg, CSV), keep encrypted file server.\nPHI fileserver stored directory controlled fairly restrictive Windows AD group. ~4 people project probably need access files, ~20 people project.\nmany benefits SQL Server CSVs Excel files .\n’s protected Odyssey (just VPN).\nprovides auditing logs.\nprovides schemas partition authorization.\nReal databases aren’t accidentally emailed copied unsecured location.\nTransfer PHI REDCap & SQL Server early possible (particularly CSVs & XLSXs regularly receive partners).\nTemporary derivative datasets stored SQL Server, CSV fileserver.","code":""},{"path":"security.html","id":"security-guidelines","chapter":"9 Security & Private Data","heading":"9.1 Security Guidelines","text":"encounter decision ’s described chapter’s security practices, follow underlying concepts. course, consult people.Principle least privilege: expose little possible.\nLimit number team members.\nLimit amount data (consider rows & columns).\nObfuscate values remove unnecessary PHI derivative datasets.\nLimit number team members.Limit amount data (consider rows & columns).Obfuscate values remove unnecessary PHI derivative datasets.Redundant layers protection.\nsingle point failure shouldn’t enough breach PHI security.\nsingle point failure shouldn’t enough breach PHI security.Simplicity possible.\nStore data two houses (eg, REDCap & SQL Server).\nEasier identify & manage bunch PHI CSVs scattered across dozen folders, versions.\nManipulate data programmatically, manually.\n\nWindows AD account controls everything, indirectly directly:\nVPN, Odyssey, file server, SQL, REDCap, & REDCap API.\n\nStore data two houses (eg, REDCap & SQL Server).Easier identify & manage bunch PHI CSVs scattered across dozen folders, versions.\nManipulate data programmatically, manually.\nManipulate data programmatically, manually.Windows AD account controls everything, indirectly directly:\nVPN, Odyssey, file server, SQL, REDCap, & REDCap API.\nVPN, Odyssey, file server, SQL, REDCap, & REDCap API.Lock team members possible.\n’s don’t trust lot unnecessary data, ’s don’t trust ex-boyfriends coffee shop hackers.\n’s don’t trust lot unnecessary data, ’s don’t trust ex-boyfriends coffee shop hackers.","code":""},{"path":"security.html","id":"dataset-level-redaction","chapter":"9 Security & Private Data","heading":"9.2 Dataset-level Redaction","text":"Several multi-layered strategies exist prevent exposing PHI. One approach simply reduce information contained variable. Much information medical record useful modeling descriptive statistics, therefore can omitted downstream datasets. techniques include:Remove variable: empty bucket nothing leak.Decrease resolution: Many times, patient’s year birth adequate analysis, include month day unnecessary risks.Hash salt identifiers: use cryptographic-quality algorithms transform ID derived value. example, “234” becomes “1432c1a399”. original value 234 recoverable 1432c1a399. two rows 1432c1a399 still attributed patient statistical model.","code":""},{"path":"security.html","id":"security-for-data-at-rest","chapter":"9 Security & Private Data","heading":"9.3 Security for Data at Rest","text":"report’s dataset(s) preferably stored REDCap SQL Server.\n’re absolutely stored GitHub local machine.\nAvoid Microsoft Access, Excel, CSVs, anything without user accounts.\nPHI must stored loose file (eg, CSV), keep encrypted file server.\n’re absolutely stored GitHub local machine.Avoid Microsoft Access, Excel, CSVs, anything without user accounts.PHI must stored loose file (eg, CSV), keep encrypted file server.PHI fileserver stored directory controlled fairly restrictive Windows AD group. ~4 people project probably need access files, ~20 people project.many benefits SQL Server CSVs Excel files .\n’s protected Odyssey (just VPN).\nprovides auditing logs.\nprovides schemas partition authorization.\nReal databases aren’t accidentally emailed copied unsecured location.\n’s protected Odyssey (just VPN).provides auditing logs.provides schemas partition authorization.Real databases aren’t accidentally emailed copied unsecured location.Transfer PHI REDCap & SQL Server early possible (particularly CSVs & XLSXs regularly receive partners).Temporary derivative datasets stored SQL Server, CSV fileserver.Hash values possible. instance, determine families/networks people, use things like SSNs. algorithm identifies clusters doesn’t need know actual SSN, just two records SSN. Something like SHA-256 hash good . algorithm can operate hashed SSN just effectively real SSN. However original SSN can’t determined hashed value. table accidentally exposed public, PHI compromised. following two files help hashing & salting process: HashUtility.R CreateSalt.R.","code":""},{"path":"security.html","id":"file-level-permissions","chapter":"9 Security & Private Data","heading":"9.4 File-level permissions","text":"","code":""},{"path":"security.html","id":"database-permissions","chapter":"9 Security & Private Data","heading":"9.5 Database permissions","text":"","code":""},{"path":"security.html","id":"public-private-repositories","chapter":"9 Security & Private Data","heading":"9.6 Public & Private Repositories","text":"","code":""},{"path":"security.html","id":"repo-rules","chapter":"9 Security & Private Data","heading":"9.6.1 Repo Rules","text":"code repository private, restricted necessary project members.repo controled OUHSC organization, individual’s private account..gitignore file prohibits common data file formats pushed/uploaded central repository.\nExamples: accdb, mdb, xlsx, csv, sas7bdat, rdata, RHistory.\ntext file without PHI must GitHub, create new extension like ’*.PhiFree’.\ncan include specific exception .gitignore file, adding exclamation point front file, !RecruitmentProductivity/RecruitingZones/ZipcodesToZone.csv. example included current repository’s [.gitignore file(https://github.com/OuhscBbmc/RedcapExamplesAndPatterns/blob/main/.gitignore).\nExamples: accdb, mdb, xlsx, csv, sas7bdat, rdata, RHistory.text file without PHI must GitHub, create new extension like ’*.PhiFree’.can include specific exception .gitignore file, adding exclamation point front file, !RecruitmentProductivity/RecruitingZones/ZipcodesToZone.csv. example included current repository’s [.gitignore file(https://github.com/OuhscBbmc/RedcapExamplesAndPatterns/blob/main/.gitignore).","code":""},{"path":"security.html","id":"scrubbing-github-history","chapter":"9 Security & Private Data","heading":"9.6.2 Scrubbing GitHub history","text":"Occasionally files may committed git repository need removed completely. just current collections files (.e., branch’s head), entire history repo.Scrubbing require typically () sensitive file accidentally committed pushed GitHub, (b) huge file bloated repository disrupted productivity.two suitable scrubbing approaches require command line. first git-filter-branch command within git, second BFG repo-cleaner. use second approach, [recommended GitHub]; requires 15 minutes install configure scratch, much easier develop , executes much faster.bash-centric steps remove files repo history called ‘monster-data.csv’ ‘bloated’ repository.file contains passwords, change immediately.file contains passwords, change immediately.Delete ‘monster-data.csv’ branch push commit GitHub.Delete ‘monster-data.csv’ branch push commit GitHub.Ask collaborators push outstanding commits GitHub delete local copy repo. scrubbing complete, re-clone .Ask collaborators push outstanding commits GitHub delete local copy repo. scrubbing complete, re-clone .Download install recent Java JRE Oracle site.Download install recent Java JRE Oracle site.Download recent jar file BFG site home directory.Download recent jar file BFG site home directory.Clone fresh copy repository user’s home directory. --mirror argument avoids downloading every file, downloads bookkeeping details required scrubbing.\ncd ~\ngit clone --mirror https://github.com/-org/bloated.gitClone fresh copy repository user’s home directory. --mirror argument avoids downloading every file, downloads bookkeeping details required scrubbing.Remove files (directory) called ‘monster-data.csv’.\njava -jar bfg-*.jar --delete-files monster-data.csv bloated.gitRemove files (directory) called ‘monster-data.csv’.Reflog garbage collect repo.\ncd bloated.git\ngit reflog expire --expire=now --&& git gc --prune=now --aggressiveReflog garbage collect repo.Push local changes GitHub server.\ngit pushPush local changes GitHub server.Delete bfg jar home directory.\ncd ~\nrm bfg-*.jarDelete bfg jar home directory.Ask collaborators re-clone repo local machine. important restart fresh copy, -scrubbed file reintroduced repo’s history.Ask collaborators re-clone repo local machine. important restart fresh copy, -scrubbed file reintroduced repo’s history.file contains sensitive information, like passwords PHI, ask GitHub support refresh cache file’s history isn’t accessible website, even repo private.\nGitHub provides chatbot helps submit request. time writing, go https://support.github.com/request?tags=docs-generic&q=remove+cached+views click “Clear cached views Virtual Agent” blue button.file contains sensitive information, like passwords PHI, ask GitHub support refresh cache file’s history isn’t accessible website, even repo private.GitHub provides chatbot helps submit request. time writing, go https://support.github.com/request?tags=docs-generic&q=remove+cached+views click “Clear cached views Virtual Agent” blue button.","code":"cd ~\ngit clone --mirror https://github.com/your-org/bloated.gitjava -jar bfg-*.jar --delete-files monster-data.csv bloated.gitcd bloated.git\ngit reflog expire --expire=now --all && git gc --prune=now --aggressivegit pushcd ~\nrm bfg-*.jar"},{"path":"security.html","id":"resources","chapter":"9 Security & Private Data","heading":"9.6.2.0.1 Resources","text":"BFG Repo-Cleaner siteAdditional BFG instructionsGitHub Sensitive Data Removal PolicyGitHub Removing sensitive data repository","code":""},{"path":"automation.html","id":"automation","chapter":"10 Automation & Reproducibility","heading":"10 Automation & Reproducibility","text":"Automation important prerequisite reproducibility.","code":""},{"path":"automation.html","id":"automation-mediator","chapter":"10 Automation & Reproducibility","heading":"10.1 Mediator","text":"nontrivial project usually multiple stages pipeline. Instead human deciding execute piece, single file execute pieces. single file makes project portable, also clearly documents process.single file special cases mediator pattern, sense defines piece relates .","code":""},{"path":"automation.html","id":"automation-flow","chapter":"10 Automation & Reproducibility","heading":"10.1.1 Flow File in R","text":"{Describe https://github.com/wibeasley/RAnalysisSkeleton/blob/main/flow.R.}See also prototypical repo.","code":""},{"path":"automation.html","id":"automation-makefile","chapter":"10 Automation & Reproducibility","heading":"10.1.2 Makefile","text":"{Briefly describe language, can efficient, additional obstacles presents.}","code":""},{"path":"automation.html","id":"automation-ssis","chapter":"10 Automation & Reproducibility","heading":"10.1.3 SSIS","text":"{Describe SSIS package development.}","code":""},{"path":"automation.html","id":"automation-scheduling","chapter":"10 Automation & Reproducibility","heading":"10.2 Scheduling","text":"","code":""},{"path":"automation.html","id":"automation-cron","chapter":"10 Automation & Reproducibility","heading":"10.2.1 cron","text":"cron common choice scheduling tasks Linux. plain text file specifies file run, recurring schedule. lot helpful documentation tutorials exists, well sites help construct validate entries like crontab guru.","code":""},{"path":"automation.html","id":"automation-task-scheduler","chapter":"10 Automation & Reproducibility","heading":"10.2.2 Task Scheduler","text":"Windows Task Scheduler common choice scheduling tasks Windows.Many GUI options easy specify, three error-prone, must specified carefully. exist “Actions” | “Start program”.Program/script: absolute path Rscript.exe. needs updated every time upgrade R (unless ’re something tricky PATH environmental OS variable). Notice using “patched” version R. entry enclosed quotes.\n\"C:\\Program Files\\R\\R-4.1.1patched\\bin\\Rscript.exe\"Program/script: absolute path Rscript.exe. needs updated every time upgrade R (unless ’re something tricky PATH environmental OS variable). Notice using “patched” version R. entry enclosed quotes.Add arguments (optional): specifies flow file run. case, repo ‘butcher-hearing-screen-1’ ’Documents/cdw/` directory; flow file located repo’s root directory, discussed prototypical repo. entry enclosed quotes.\n\"C:\\Users\\wbeasley\\Documents\\cdw\\butcher-hearing-screen-1\\flow.R\"Add arguments (optional): specifies flow file run. case, repo ‘butcher-hearing-screen-1’ ’Documents/cdw/` directory; flow file located repo’s root directory, discussed prototypical repo. entry enclosed quotes.Start (optional): sets working directory. properly set, relative paths files point correct locations. identical entry , () include ‘/flow.R’ (b) contains quotes.\nC:\\Users\\wbeasley\\Documents\\cdw\\butcher-hearing-screen-1Start (optional): sets working directory. properly set, relative paths files point correct locations. identical entry , () include ‘/flow.R’ (b) contains quotes.options typically specify :\nSelect “Run whether user logged .”\n\nConfigure highest available version Windows, using dropdown box.\n\n“Wake computer run task” probably necessary located normal desktop. something specify, tasks located VM-based workstation never turned .\nFollowing instructions, required enter password every time modify task, every time update password. using network credentials, probably specify account like “domain/username”. careful: modify task prompted password, GUI may subtly alter account entry just “username” (instead “domain”). Make sure prepend username domain, enter password.10+ tasks, consider creating System Environment Variable called %rscript_path% whose value something like \"C:\\Program Files\\R\\R-4.1.1patched\\bin\\Rscript.exe\". text %rscript_path% goes step one (“Program/script” ). R updated every months, need change path one place (.e., Environment Variables GUI) instead task, requires repeatedly re-entering username password. defined tasks differently describe , may need restart machine load fresh variable value Task Scheduler environment.code executed task scheduler accesses network drive file share, path naturally reference mapped letter. easiest solution spell full path. instance Python/R code, replace “Q:/subdirectory/hospital-location.csv” “//server-name/data-files/subdirectory/hospital-location.csv”.","code":"\"C:\\Program Files\\R\\R-4.1.1patched\\bin\\Rscript.exe\"\"C:\\Users\\wbeasley\\Documents\\cdw\\butcher-hearing-screen-1\\flow.R\"C:\\Users\\wbeasley\\Documents\\cdw\\butcher-hearing-screen-1"},{"path":"automation.html","id":"automation-sql-server-agent","chapter":"10 Automation & Reproducibility","heading":"10.2.3 SQL Server Agent","text":"SQL Server Agent executes jobs specified schedule. also naturally interfaces SSIS packages deployed server, can also execute formats, like plain sql file.important distinction runs service database server, opposed Task Scheduler, runs service client machine. prefer running jobs server job either:requires elevated/administrative privileges (instance, access sensitive data),require lot network constraints passing large amounts data server client, orfeels like server’s responsibility, rebuilding database index, archiving server logs.","code":""},{"path":"automation.html","id":"auxiliary-issues","chapter":"10 Automation & Reproducibility","heading":"10.3 Auxiliary Issues","text":"following subsections execute schedule code, considered.","code":""},{"path":"automation.html","id":"sink-log-files","chapter":"10 Automation & Reproducibility","heading":"10.3.1 Sink Log Files","text":"{Describe sink output file can examined easily.}","code":""},{"path":"automation.html","id":"package-versions","chapter":"10 Automation & Reproducibility","heading":"10.3.2 Package Versions","text":"project runs repeatedly schedule without human intervention, errors can easily go undetected simple systems. , error messages may clear running procedure RStudio. reasons, plan strategy maintaining version R packages. three approaches tradeoffs.conventional projects, keep packages date, live occasional breaks time. ’s time update packages week, () run daily reports morning, (b) update packages (R & RStudio necessary), (c) rereun reports, finally (d) verify results & c . something different, day adapt pipeline code breaking changes packages.\nupdating package, read NEWS file changes backwards-compatible (commonly called “breaking changes” news file).\nchanges pipeline code difficult complete day, can roll back previous version remotes::install_version().conventional projects, keep packages date, live occasional breaks time. ’s time update packages week, () run daily reports morning, (b) update packages (R & RStudio necessary), (c) rereun reports, finally (d) verify results & c . something different, day adapt pipeline code breaking changes packages.updating package, read NEWS file changes backwards-compatible (commonly called “breaking changes” news file).changes pipeline code difficult complete day, can roll back previous version remotes::install_version().side spectrum, can meticulously specify desired version R package. approach reduces chance new version package breaking existing pipeline code. recommend approach uptime important.\nintuitive implementation install explicit code file like utility/install-dependencies.R:\n\nremotes::install_version(\"dplyr\"     , version = \"0.4.3\" )\nremotes::install_version(\"ggplot2\"   , version = \"2.0.0\" )\nremotes::install_version(\"data.table\", version = \"1.10.4\")\nremotes::install_version(\"lubridate\" , version = \"1.6.0\" )\nremotes::install_version(\"openxlsx\"  , version = \"4.0.17\")\n# ... package list continues ...\nAnother implementation convert repo package , specify versions DESCRIPTION file.\nImports:\n   dplyr       (== 0.4.3 )\n   ggplot2     (== 2.0.0 )\n   data.table  (== 1.10.4)\n   lubridate   (== 1.6.0 )\n   openxlsx    (== 4.0.17)\ndownside can difficult set identical machine months. Sometimes packages depend package version incompatible package versions. example, one point, current version dplyr 0.4.3. months later, rlang package (wasn’t explicitly specified list 42 packages) required least version 0.8.0 dplyr. developer new machine needs decide whether upgrade dplyr (test breaking changes pipeline) install older version rlang.\nsecond important downside approach can lock user’s projects specific outdated package version.\nothers8 advocate approach team experienced R, machine dedicated important line--business workflow.\nuptime important team experienced languages like Java, Python, C#, consider better suited.side spectrum, can meticulously specify desired version R package. approach reduces chance new version package breaking existing pipeline code. recommend approach uptime important.intuitive implementation install explicit code file like utility/install-dependencies.R:Another implementation convert repo package , specify versions DESCRIPTION file.downside can difficult set identical machine months. Sometimes packages depend package version incompatible package versions. example, one point, current version dplyr 0.4.3. months later, rlang package (wasn’t explicitly specified list 42 packages) required least version 0.8.0 dplyr. developer new machine needs decide whether upgrade dplyr (test breaking changes pipeline) install older version rlang.second important downside approach can lock user’s projects specific outdated package version.others8 advocate approach team experienced R, machine dedicated important line--business workflow.uptime important team experienced languages like Java, Python, C#, consider better suited.compromise two previous approaches renv package - R Environmentals. successor packrat. requires learning cognitive overhead. investment becomes appealing () running hourly predictions downtime big deal, (b) machine contains multiple projects require different versions package (dplyr 0.4.3 dplyr 0.8.0).compromise two previous approaches renv package - R Environmentals. successor packrat. requires learning cognitive overhead. investment becomes appealing () running hourly predictions downtime big deal, (b) machine contains multiple projects require different versions package (dplyr 0.4.3 dplyr 0.8.0).","code":"\nremotes::install_version(\"dplyr\"     , version = \"0.4.3\" )\nremotes::install_version(\"ggplot2\"   , version = \"2.0.0\" )\nremotes::install_version(\"data.table\", version = \"1.10.4\")\nremotes::install_version(\"lubridate\" , version = \"1.6.0\" )\nremotes::install_version(\"openxlsx\"  , version = \"4.0.17\")\n# ... package list continues ...Imports:\n   dplyr       (== 0.4.3 )\n   ggplot2     (== 2.0.0 )\n   data.table  (== 1.10.4)\n   lubridate   (== 1.6.0 )\n   openxlsx    (== 4.0.17)"},{"path":"scaling-up.html","id":"scaling-up","chapter":"11 Scaling Up","heading":"11 Scaling Up","text":"","code":""},{"path":"scaling-up.html","id":"data-storage","chapter":"11 Scaling Up","heading":"11.1 Data Storage","text":"Local File vs Conventional Database vs RedshiftUsage Cases","code":""},{"path":"scaling-up.html","id":"data-processing","chapter":"11 Scaling Up","heading":"11.2 Data Processing","text":"R vs SQLR vs Spark","code":""},{"path":"collaboration.html","id":"collaboration","chapter":"12 Parallel Collaboration","heading":"12 Parallel Collaboration","text":"","code":""},{"path":"collaboration.html","id":"social-contract","chapter":"12 Parallel Collaboration","heading":"12.1 Social Contract","text":"IssuesOrganized Commits & Coherent DiffsBranch & Merge Strategy","code":""},{"path":"collaboration.html","id":"code-reviews","chapter":"12 Parallel Collaboration","heading":"12.2 Code Reviews","text":"Daily Reviews PRsPeriodic Reviews Files","code":""},{"path":"collaboration.html","id":"remote","chapter":"12 Parallel Collaboration","heading":"12.3 Remote","text":"Headset & sharing screens","code":""},{"path":"collaboration.html","id":"additional-resources-1","chapter":"12 Parallel Collaboration","heading":"12.4 Additional Resources","text":"(Colin Gillespie 2017), particularly “Efficient collaboration” chapter.(Brian Fitzpatrick 2012)","code":""},{"path":"collaboration.html","id":"loose-notes","chapter":"12 Parallel Collaboration","heading":"12.5 Loose Notes","text":"","code":""},{"path":"collaboration.html","id":"github","chapter":"12 Parallel Collaboration","heading":"12.5.1 GitHub","text":"Review diffs committing. Check things like accidental deletions debugging code deleted (least commented ).Review diffs committing. Check things like accidental deletions debugging code deleted (least commented ).Keep chatter minimum, especially projects 3+ people notified every issue post.Keep chatter minimum, especially projects 3+ people notified every issue post.encountering problem,\nTake much ownership reasonable. Don’t merely report ’s error.\ncan’t figure , ask question describe well.\nlow-level file & line code threw error.\ntried solve .\n\n’s questionable line/chunk code, trace origin. sake pointing finger someone, sake understanding origin history.\nencountering problem,Take much ownership reasonable. Don’t merely report ’s error.can’t figure , ask question describe well.\nlow-level file & line code threw error.\ntried solve .\nlow-level file & line code threw error.tried solve .’s questionable line/chunk code, trace origin. sake pointing finger someone, sake understanding origin history.","code":""},{"path":"collaboration.html","id":"common-code","chapter":"12 Parallel Collaboration","heading":"12.5.2 Common Code","text":"involves code/files multiple people use, like REDCap arches.Run file committing . Run common downstream files (e.g., make change arch, also run funnel).upstream variable name must change, alert people. Post GitHub issue announce . Tell everyone, search repo (ctrl+shift+f RStudio) alert specific people might affected.","code":""},{"path":"document.html","id":"document","chapter":"13 Documentation","heading":"13 Documentation","text":"","code":""},{"path":"document.html","id":"team-wide","chapter":"13 Documentation","heading":"13.1 Team-wide","text":"","code":""},{"path":"document.html","id":"project-specific","chapter":"13 Documentation","heading":"13.2 Project-specific","text":"","code":""},{"path":"document.html","id":"dataset-origin-structure","chapter":"13 Documentation","heading":"13.3 Dataset Origin & Structure","text":"","code":""},{"path":"document.html","id":"document-issues","chapter":"13 Documentation","heading":"13.4 Issues & Tasks","text":"","code":""},{"path":"document.html","id":"documentation-issue-template","chapter":"13 Documentation","heading":"13.4.1 GitHub Issue Template","text":"going open repo/package public, consider creating template GitHub Issues ’s tailored repo’s unique characteristics. Furthermore, invite feedback user base improve template. appeal REDCapR produced Unexpected Behavior issue template:@nutterb @haozhu233, @rparrish, @sybandrew, one else, time, please look new issue template customized REDCapR/redcapAPI. ’d appreciate feedback improve experience someone encountering problem.’d like something () make easier user provide useful information less effort (b) make easier us help accurately fewer back--forths. template happens help user identify solve problem without creating issue …think everyone happier .think issue leverage Troubleshooter 10+ people contributed . help locate problematic area quickly.@haozhu233, seems ’ve liked template kableExtra. REDCapR different sense ’s difficult provide minimal & self-contained example reproduce problem. experience many users issues, ’d love advice.@nutterb, ’d like template helpful redcapAPI . three quick find--replace occurrences ‘REDCapR’ -> ‘redcapAPI’. mostly distinguish R package REDCap .","code":""},{"path":"document.html","id":"flow-diagrams","chapter":"13 Documentation","heading":"13.5 Flow Diagrams","text":"","code":""},{"path":"document.html","id":"document-workstation","chapter":"13 Documentation","heading":"13.6 Setting up new machine","text":"Thoroughly describe programs configuration settings team follow. Feel free adapt list needs.’ll see handful benefits:New hires productive sooner, able spend time conceptual issues instead walking tedious installation issues.New hires productive sooner, able spend time conceptual issues instead walking tedious installation issues.everyone team similar environment, easier share code. quality code hopefully improves everyone can leverage others contributions.everyone team similar environment, easier share code. quality code hopefully improves everyone can leverage others contributions.Sometimes department reluctant grant admin rights, especially new users. likely trust team installation documentation demonstrates thought carefully issues. Typically users just need programs like Office Adobe; may realize many tools used well-round data scientist.\nstill reluctant grant admin privileges, make sure realize () takes ~45 minutes install ~12 programs fresh machine, (b) many programs updated every months, (c) data scientist typically installs 5+ R packages month explore tools stay current field. Installing maintaining everyone’s workstation require significant amount time. team willing help alleviate burden maintain software.Sometimes department reluctant grant admin rights, especially new users. likely trust team installation documentation demonstrates thought carefully issues. Typically users just need programs like Office Adobe; may realize many tools used well-round data scientist.still reluctant grant admin privileges, make sure realize () takes ~45 minutes install ~12 programs fresh machine, (b) many programs updated every months, (c) data scientist typically installs 5+ R packages month explore tools stay current field. Installing maintaining everyone’s workstation require significant amount time. team willing help alleviate burden maintain software.","code":""},{"path":"document.html","id":"document-mechanics","chapter":"13 Documentation","heading":"13.7 Documenting with Markdown in a GitHub Repo","text":"quick demo walks https://national-covid-cohort-collaborative.github.io/book--n3c-v1/Select correct file repo.","code":""},{"path":"style.html","id":"style","chapter":"14 Style Guide","heading":"14 Style Guide","text":"Using consistent style across projects can increase overhead data science team discusses options, decides good choice, develops compliant code. like themes document, cost worth effort. Unforced code errors reduced code consistent, mistake-prone styles apparent.part, team follows tidyverse style. additional conventions attempt follow. Many inspired (Francesco Balena 2005).","code":""},{"path":"style.html","id":"readability","chapter":"14 Style Guide","heading":"14.1 Readability","text":"","code":""},{"path":"style.html","id":"style-number","chapter":"14 Style Guide","heading":"14.1.1 Number","text":"word “number” ambiguous, especially data science. Try specific terms:count: number discrete objects events, visit_count, pt_count, dx_count.id: value uniquely identifies entity doesn’t change time, pt_id, clinic_id, client_id,index: 1-based sequence ’s typically temporary, unique within dataset. instance, pt_index 195 Tuesday’s dataset like;y different person pt_index 195 Wednesday. given day, one value 195.tag: persistent across time like “id”, typically created analysts send research team. See snippet appendix example.tally: running countduration: length time. Specify units self-evident.physical statistical quantities like\n“depth”,\n“length”,\n“mass”,\n“mean”, \n“sum”.","code":""},{"path":"style.html","id":"style-abbreviation","chapter":"14 Style Guide","heading":"14.1.2 Abbreviations","text":"Try avoid abbreviations. Different people tend shorten words differently; variability increases chance people reference wrong variable. least, wastes time trying remember subject_number, subject_num, subject_no used. Consistency section describes can reduce errors increase efficiency.However, terms long reasonably use without shortening. make exceptions, following scenarios:humans commonly use term orally. instance, people tend say “” instead “operating room”.humans commonly use term orally. instance, people tend say “” instead “operating room”.team agreed set list abbreviations. list CDW team includes:\nappt (“apt”),\ncdw,\ncpt,\ndrg (stands diagnosis-related group),\ndx,\nhx,\nicd\npt, \nvr (vital records).team agreed set list abbreviations. list CDW team includes:\nappt (“apt”),\ncdw,\ncpt,\ndrg (stands diagnosis-related group),\ndx,\nhx,\nicd\npt, \nvr (vital records).team choose terms (e.g., ‘apt’ vs ‘appt’), try use standard vocabulary, MedTerms Medical Dictionary.","code":""},{"path":"style.html","id":"style-datasets","chapter":"14 Style Guide","heading":"14.2 Datasets","text":"","code":""},{"path":"style.html","id":"style-datasets-filter","chapter":"14 Style Guide","heading":"14.2.1 Filtering Rows","text":"Removing datasets rows important operation frequent source sneaky errors. practices reduce mistakes improve maintainability.","code":""},{"path":"style.html","id":"style-datasets-filter-number-line","chapter":"14 Style Guide","heading":"14.2.1.1 Mimic number line","text":"ordering quantities, go smallest--largest type left--right. minimum consistent direction. words, use operators like < <= avoid > >=. approach also makes consistent SQL dplyr function, ().","code":"\n# Good (b/c quantities increase as you read left-to-right)\nds_teenager |>\n  dplyr::filter(13 <= age & age < 20)\n\n# Not as good (b/c quantities increase as you read right-to-left)\nds_teenager |>\n  dplyr::filter(20 > age & age <= 13)\n\n# Bad (b/c the order is inconsistent)\nds_teenager |>\n  dplyr::filter(age >= 13 & age < 20)\nds_teenager |>\n  dplyr::filter(age < 20 & age >= 13)"},{"path":"style.html","id":"style-datasets-filter-searchable","chapter":"14 Style Guide","heading":"14.2.1.2 Searchable verbs","text":"’ve occasionally asked frustration, “dataset lose rows? 900 rows middle script, now 782.” scan script location potentially removes rows. locations easier identify ’re scanning small set filtering functions \ntidyr::drop_na(),\ndplyr::filter(), \ndplyr::summarize(). can even highlight ‘ctrl+f’. contrast, base R’s filtering style difficult identify.","code":"\n# tidyverse's approach is easy to see in a long script\nds <-\n  ds |>\n  dplyr::filter(4 <= count)\n  \n# base R's approach is harder to see\nds <- ds[4 <= ds$count, ]"},{"path":"style.html","id":"style-datasets-filter-drop_na","chapter":"14 Style Guide","heading":"14.2.1.3 Remove rows with missing values","text":"Even within tidyverse functions, preferences certain scenarios. entry covers scenario dropping entire row important column missing value.tidyr::drop_na() removes rows missing value specific column. cleaner read write dplyr’s filter() base R’s subsetting bracket. particular, ’s easy forget/overlook !.","code":"\n# Cleanest\nds |>\n  tidyr::drop_na(dob)\n\n# Not as good\nds |>\n  dplyr::filter(!is.na(dob))\n\n# Ripest for mistakes or misinterpretation\nds[!is.na(ds$dob), ]"},{"path":"style.html","id":"style-datasets-attach","chapter":"14 Style Guide","heading":"14.2.2 Don’t attach","text":"Google Stylesheet says, “possibilities creating errors using attach() numerous.”Hopefully ’ve learned R recently enough haven’t read examples 1990s used attach(). may made sense early days S-PLUS language used primarily interactively single statistician. contemporary tradeoffs unfavorable, now R scripts frequently run multiple people functions run multiple contexts.","code":""},{"path":"style.html","id":"style-factor","chapter":"14 Style Guide","heading":"14.3 Categorical Variables","text":"lots names categorical variable across different disciplines (e.g., factor, categorical, …).","code":""},{"path":"style.html","id":"style-factor-unknown","chapter":"14 Style Guide","heading":"14.3.1 Explicit Missing Values","text":"Define level like \"unknown\" data manipulation doesn’t test .na(x) x == \"unknown\". explicit label also helps included statistical procedure coefficient table.","code":""},{"path":"style.html","id":"style-factor-granularity","chapter":"14 Style Guide","heading":"14.3.2 Granularity","text":"Sometimes helps represent values differently, say granular variable coarse variable. two related variables 7 3 levels respectively, say *_cut7 *_cut3 denote resolution; related base::cut(). Don’t forget include “unknown” “” necessary.dplyr::recode_factor() ideal replacement scenario , single call combines work dplyr::recode() base::factor(). Just make sure recoding order represents desired order factor levels.","code":"# Inside a dplyr::mutate() clause\neducation_cut7      = dplyr::recode(\n  education_cut7,\n  \"No Highschool Degree / GED\"  = \"no diploma\",\n  \"High School Degree / GED\"    = \"diploma\",\n  \"Some College\"                = \"some college\",\n  \"Associate's Degree\"          = \"associate\",\n  \"Bachelor's Degree\"           = \"bachelor\",\n  \"Post-graduate degree\"        = \"post-grad\",\n  \"Unknown\"                     = \"unknown\",\n  .missing                      = \"unknown\",\n),\neducation_cut3      = dplyr::recode(\n  education_cut7,\n  \"no diploma\"    = \"no bachelor\",\n  \"diploma\"       = \"no bachelor\",\n  \"some college\"  = \"no bachelor\",\n  \"associate\"     = \"no bachelor\",\n  \"bachelor\"      = \"bachelor\",\n  \"post-grad\"     = \"bachelor\",\n  \"unknown\"       = \"unknown\",\n),\neducation_cut7 = factor(education_cut7, levels=c(\n  \"no diploma\",\n  \"diploma\",\n  \"some college\",\n  \"associate\",\n  \"bachelor\",\n  \"post-grad\",\n  \"unknown\"\n)),\neducation_cut3 = factor(education_cut3, levels=c(\n  \"no bachelor\",\n  \"bachelor\",\n  \"unknown\"\n)),# Inside a dplyr::mutate() clause\neducation_cut7      = dplyr::recode_factor(\n  education_cut7,\n  \"No Highschool Degree / GED\"  = \"no diploma\",\n  \"High School Degree / GED\"    = \"diploma\",\n  \"Some College\"                = \"some college\",\n  \"Associate's Degree\"          = \"associate\",\n  \"Bachelor's Degree\"           = \"bachelor\",\n  \"Post-graduate degree\"        = \"post-grad\",\n  \"Unknown\"                     = \"unknown\",\n  .missing                      = \"unknown\",\n),\neducation_cut3      = dplyr::recode_factor(\n  education_cut7,\n  \"no diploma\"    = \"no bachelor\",\n  \"diploma\"       = \"no bachelor\",\n  \"some college\"  = \"no bachelor\",\n  \"associate\"     = \"no bachelor\",\n  \"bachelor\"      = \"bachelor\",\n  \"post-grad\"     = \"bachelor\",\n  \"unknown\"       = \"unknown\",\n),"},{"path":"style.html","id":"style-dates","chapter":"14 Style Guide","heading":"14.4 Dates","text":"Date arithmetic hard. Naming dates well might harder.birth_month_index can values 1 12, birth_month (commonly mob) contains year (e.g., 2014-07-15).birth_month_index can values 1 12, birth_month (commonly mob) contains year (e.g., 2014-07-15).birth_year integer, birth_month birth_week dates. Typically months collapsed 15th day weeks collapsed Monday, defaults OuhscMunge::clump_month_date() OuhscMunge::clump_week_date(). obfuscate real value PHI involved. Months centered midpoint usually better representation month’s performance month’s initial day.birth_year integer, birth_month birth_week dates. Typically months collapsed 15th day weeks collapsed Monday, defaults OuhscMunge::clump_month_date() OuhscMunge::clump_week_date(). obfuscate real value PHI involved. Months centered midpoint usually better representation month’s performance month’s initial day.Don’t use minus operator (.e., -). See Defensive Date Arithmetic.Don’t use minus operator (.e., -). See Defensive Date Arithmetic.","code":""},{"path":"style.html","id":"style-naming","chapter":"14 Style Guide","heading":"14.5 Naming","text":"","code":""},{"path":"style.html","id":"style-naming-variables","chapter":"14 Style Guide","heading":"14.5.1 Variables","text":"builds upon tidyverse style guide objects.","code":""},{"path":"style.html","id":"style-naming-variables-characters","chapter":"14 Style Guide","heading":"14.5.1.1 Characters","text":"Use lowercase letters, using underscores separate words. Avoid uppercase letters periods.","code":""},{"path":"style.html","id":"style-naming-semantic","chapter":"14 Style Guide","heading":"14.5.2 Semantic Order","text":"variables including multiple nouns adjectives, place global terms microscopic terms. “bigger” term goes first; “smaller” terms successively nested bigger terms.Large datasets multiple questionnaires (multiple subsections) much manageable variables follow semantic order.don’t know picked term “semantic order”. may come Semantic Versioning software releases.","code":"\n# Good:\nparent_name_last\nparent_name_first\nparent_dob\nkid_name_last\nkid_name_first\nkid_dob\n\n# Bad:\nlast_name_parent\nfirst_name_parent\ndob_parent\nlast_name_kid\nfirst_name_kid\ndob_kidSELECT\n  asq3_medical_problems_01\n  ,asq3_medical_problems_02\n  ,asq3_medical_problems_03\n  ,asq3_behavior_concerns_01\n  ,asq3_behavior_concerns_02\n  ,asq3_behavior_concerns_03\n  ,asq3_worry_01\n  ,asq3_worry_02\n  ,asq3_worry_03\n  ,wai_01_steps_beneficial\n  ,wai_02_hv_useful\n  ,wai_03_parent_likes_me\n  ,wai_04_hv_doubts\n  ,hri_01_client_input\n  ,hri_02_problems_discussed\n  ,hri_03_addressing_problems_clarity\n  ,hri_04_goals_discussed\nFROM miechv.gpav_3"},{"path":"style.html","id":"style-naming-files","chapter":"14 Style Guide","heading":"14.5.3 Files and Folders","text":"Naming files folders/directories follows style naming variables, one small difference: separate words dashes (.e., -), underscores (.e., _). words, “kebab case” instead “snake case.Occasionally, ’ll use dash helps identify noun (already contains underscore). instance, ’s table called patient_demographics, might call files patient_demographics-truncate.sql patient_demographics-insert.sql.Using lower case important databases operating systems case-sensitive, case-insensitive. promote portability, keep everything lowercase., file folder names contain () lowercase letters, (b) digits, (c) dashes, (d) occasional dash. include spaces, uppercase letters, especially punctuation, : (.","code":""},{"path":"style.html","id":"style-naming-datasets","chapter":"14 Style Guide","heading":"14.5.4 Datasets","text":"tibbles (fancy data.frames) used almost every analysis file, put extra effort formulating conventions informative consistent. Naming datasets follows style naming variables, additional features.R world, “dataset” typically synonym data.frame –rectangular structure rows columns. database equivalent conventional table. Note “dataset” means collections tables .NET world, collection (-necessarily-rectangular) files Dataverse.9","code":""},{"path":"style.html","id":"style-naming-datasets-prefix","chapter":"14 Style Guide","heading":"14.5.4.1 Prefix with ds_ and d_","text":"Datasets handled differently variables find ’s easier identify type scope. prefix ds_ indicates dataset available entire file, d_ indicates scope localized function.","code":"\ncount_elements <- function (d) {\n  nrow(d) * ncol(d)\n}\n\nds <- mtcars\ncount_elements(d = ds)"},{"path":"style.html","id":"style-naming-datasets-grain","chapter":"14 Style Guide","heading":"14.5.4.2 Express the grain","text":"grain dataset describes row represents, similar idea statistician’s concept “unit analysis”. Essentially granular entity described. Many miscommunications silly mistakes avoided team disciplined enough define tidy dataset clear grain.insight grains, Ralph Kimball writesIn debugging literally thousands dimensional designs students years, found frequent design error far declaring grain fact table beginning design process. grain isn’t clearly defined, whole design rests quicksand. Discussions candidate dimensions go around circles, rogue facts introduce application errors sneak design.\n…\nhope ’ve noticed powerful effects declaring grain. First, can visualize dimensionality doctor bill line item precisely, can therefore confidently examine data sources, deciding whether dimension can attached data. example, probably exclude “treatment outcome” example medical billing data doesn’t tie notion outcome.","code":"\nds_student          # One row per student\nds_teacher          # One row per teacher\nds_course           # One row per course\nds_course_student   # One row per student-course combination\nds_pt         # One row per patient\nds_pt_visit   # One row per patient-visit combination\nds_visit      # Same as above, since it's clear a visit is connected w/ a pt"},{"path":"style.html","id":"style-naming-datasets-singular","chapter":"14 Style Guide","heading":"14.5.4.3 Singular table names","text":"adopt style table’s name reflects grain, corollary. grain singular like “one row per client” “one row per building”, name ds_client ds_building (ds_clients ds_buildings). datasets saved database, tables called client building.Table names plural grain plural. record field like client_id, date_birth, date_graduation date_death, suggest called table client_milestones (single row contains three milestones).Stack Overflow post presents variety opinions justifications adopting singular plural naming scheme.think ’s acceptable R vectors follow different style R data.frames. instance, vector can plural name even though element singular (e.g., client_ids <- c(10, 24, 25)).","code":""},{"path":"style.html","id":"style-naming-datasets-ds-only","chapter":"14 Style Guide","heading":"14.5.4.4 Use ds when definition is clear","text":"Many times ellis file handles one incoming csv outgoing dataset, grain obvious –typically ellis filename clearly states grain.case, R script can use just ds instead ds_county.","code":""},{"path":"style.html","id":"style-naming-datasets-adjective","chapter":"14 Style Guide","heading":"14.5.4.5 Use an adjective after the grain, if necessary","text":"R file manipulating two datasets grain, qualify differences grain, ds_client_all ds_client_michigan. Adjectives commonly indicate one dataset subset another.occasional limitation naming scheme difficult distinguish grain adjective. instance, grain ds_student_enroll either () every instance student enrollment (.e., student enroll describe grain) (b) subset students enrolled (.e., student grain enroll adjective)? ’s clear without examine code, comments, documentation.someone solution, love hear . far, ’ve reluctant decorate variable name , ds_grain_client_adj_enroll.","code":""},{"path":"style.html","id":"style-naming-datasets-define","chapter":"14 Style Guide","heading":"14.5.4.6 Define the dataset when in doubt","text":"’s potentially unclear new reader, use comment immediately dataset’s initial use. grain frequently important characteristic document.","code":"\n# `ds_client_enroll`:\n#    grain: one row per client\n#    subset: only clients who have successfully enrolled are included\n#    source: the `client` database table, where `enroll_count` is 1+.\nds_client_enroll <- ..."},{"path":"style.html","id":"style-whitespace","chapter":"14 Style Guide","heading":"14.6 Whitespace","text":"Although execution rarely affected whitespace R SQL files, consistent minimalistic. One benefit Git diffs won’t show unnecessary churn. line code lights diff, ’s nice reflect real change, something trivial like tabs converted spaces, trailing spaces added deleted.guidelines handled automatically modern IDEs, configure correct settings.Tabs replaced spaces. modern IDEs option automatically. (RStudio calls “Insert spaces tabs”.)Indentions replaced consistent number spaces, depending file type.\nR: 2 spaces\nSQL: 2 spaces\nPython: 4 spaces\nR: 2 spacesSQL: 2 spacesPython: 4 spacesEach file end blank line. (RStudio checkbox “Ensure source files end newline.”)Remove spaces tabs end lines.\nVS Code: see VS Code section Workstation chapter.\nAzure Data Studio: See ADS section Workstation chapter.\nRStudio: Global Options | Code | Saving | Strip trailing horizontal whitespace saving.\nSSMS:\nVS Code: see VS Code section Workstation chapter.Azure Data Studio: See ADS section Workstation chapter.RStudio: Global Options | Code | Saving | Strip trailing horizontal whitespace saving.SSMS:","code":""},{"path":"style.html","id":"style-database","chapter":"14 Style Guide","heading":"14.7 Database","text":"GitLab’s data team good style guide databases sql ’s fairly consistent style. important additions differences areFavor CTEs subqueries ’re easier follow can reused file. performance problem, slightly rewrite CTE temp table see new indexes help.\nResources:\nBrent Ozar’s SQL Server Common Table Expressions defines basics:\n\nCTE effectively creates temporary view developer can reference multiple times underlying query.\n\nBrent Ozar’s ’s Better, CTEs Temp Tables? article’s bottom line :\n\n’d suggest starting CTEs ’re easy write read. hit performance wall, try ripping CTE writing temp table, joining temp table.\n\nFavor CTEs subqueries ’re easier follow can reused file. performance problem, slightly rewrite CTE temp table see new indexes help.Resources:Brent Ozar’s SQL Server Common Table Expressions defines basics:\n\nCTE effectively creates temporary view developer can reference multiple times underlying query.\nBrent Ozar’s SQL Server Common Table Expressions defines basics:CTE effectively creates temporary view developer can reference multiple times underlying query.Brent Ozar’s ’s Better, CTEs Temp Tables? article’s bottom line :\n\n’d suggest starting CTEs ’re easy write read. hit performance wall, try ripping CTE writing temp table, joining temp table.\nBrent Ozar’s ’s Better, CTEs Temp Tables? article’s bottom line :’d suggest starting CTEs ’re easy write read. hit performance wall, try ripping CTE writing temp table, joining temp table.name primary key typically contain table. employee table, key employee_id, id.name primary key typically contain table. employee table, key employee_id, id.","code":""},{"path":"style.html","id":"style-repo","chapter":"14 Style Guide","heading":"14.8 Code Repositories","text":"analytical team dedicates private repo research project. repository GitHub accessible team members granted explicit privileges. Repos also discussed Git & GitHub appendix.","code":""},{"path":"style.html","id":"style-repo-naming","chapter":"14 Style Guide","heading":"14.8.1 Repo Naming","text":"2022, GitHub organization 300 repos. Many focused warehouse projects completed within month. easiest stable naming system ’ve found built three parts:PI’s last name. Even contact project manager, prefer use primary investigator’s name (typically name IRB application) rarely changes easier trace right team. refer medical resident fellow rotate months.Two three word term. Describe global area words.Index. optimistic prepare follow investigations. initial repo “…-1”, subsequent repos “…-2, …-3, …-4”.informally call “project tag” try use consistently different arenas, :GitHub repo’s name.parent directory project file server (e.g., M:/pediatrics/bbmc/akande-covid-1).database schema containing project’s tables (e.g., akande_covid_1.patient, akande_covid_1.visit, akande_covid_2.visit). Change kebab case snake case (e.g., akande-covid-1 akande_covid_1) sql code doesn’t escape schema name brackets.body emails help retrospective searches.","code":"\n# Good Examples\nakande-asthma-hospitalization-1\nakande-asthma-hospitalization-2\nakande-covid-1\nbard-covid-1\nbard-covid-2\nbard-eeg-education-1\n\n# Bad Examples\nakande-1\nakande-2\ncovid-1\ncovid-2\ncovid-3\nbard-research-1"},{"path":"style.html","id":"style-repo-granularity","chapter":"14 Style Guide","heading":"14.8.2 Repo Granularity","text":"boundaries research project may fuzzy, may clear answer question, “considered one large research project one repo, two smaller research projects two total repos?”. deciding factor us usually determined amount living code need exist repos. two projects developed parallel make similar changes repos, strongly consider using one repo.issues suggest unified repo:two repos almost identical users.two repos covered IRB.Issues suggest separate repos:development windows don’t overlap. initial project wrapped last year follow-study starting, consider separate repo starts subset code. Start fresh copy ’s necessary","code":""},{"path":"style.html","id":"style-repo-pricing","chapter":"14 Style Guide","heading":"14.8.3 Repo Pricing","text":"enrolled GitHub program 2012 allows academic research group unlimited private repos GitHub Organization. Otherwise, feasible 300+ tightly-focused repos.GitHub seems introduce new programs modify existing branding every years. current best documentation “Apply educator researcher discount”. Notice program lightweight program like “GitHub Campus”, involves whole campus apparently.","code":""},{"path":"style.html","id":"style-ggplot","chapter":"14 Style Guide","heading":"14.9 ggplot2","text":"expressiveness ggplot2 allows someone quickly develop precise scientific graphics. One graph can specified many equivalent styles, increases opportunity confusion. formalized much style writing textbook introductory statistics (Lise DeShea (2015)); 200+ graphs code publicly available.additional ggplot2 tips tidyverse style guide.","code":""},{"path":"style.html","id":"style-ggplot-order","chapter":"14 Style Guide","heading":"14.9.1 Order of commands","text":"ggplot2 essentially collection functions combined + operator. Publication graphs common require least 20 functions, means functions can sometimes redundant step toes. family functions follow consistent order ideally starting important structural functions ending cosmetic functions. preference :ggplot() primary function specify default dataset aesthetic mappings. Many arguments can passed aes(), prefer follow order consistent scale_*() order .geom_*() annotate() creates geometric elements represent data. Unlike categories list, order matters. Geoms specified first drawn first, therefore can obscured subsequent geoms.scale_*() describes dimension data (specified aes()) translated visual element. specify dimensions descending order (typical) importance: x, y, group, color, fill, size, radius, alpha, shape, linetype.coord_*()facet_*() label_*()guides()theme() (call ‘big’ themes like theme_minimal() overriding details like theme(panel.grid = element_line(color = \"gray\")))labs()graph contains typical ggplot2 elements.","code":"ggplot(ds, aes(x = group, y = lift_count, fill = group, color = group)) +\n  geom_bar(stat = \"summary\", fun.y = \"mean\", color = NA) +\n  geom_point(position = position_jitter(w = 0.4, h = 0), shape = 21) +\n  scale_color_manual(values = palette_pregnancy_dark) +\n  scale_fill_manual( values = palette_pregnancy_light) +\n  coord_flip() +\n  facet_wrap(\"time\") +\n  theme_minimal() +\n  theme(legend.position = \"none\") +\n  theme(panel.grid.major.y = element_blank()) +\n  labs(\n    title = \"Lifting by Group across Time\"\n    x     = NULL, \n    y     = \"Number of Lifts\"\n  )"},{"path":"style.html","id":"style-ggplot-gotchas","chapter":"14 Style Guide","heading":"14.9.2 Gotchas","text":"common mistakes see --infrequently (even sometimes ggplot2 code).","code":""},{"path":"style.html","id":"style-ggplot-zoom","chapter":"14 Style Guide","heading":"14.9.2.1 Zooming","text":"Call coord_*() restrict plotted x/y values, scale_*() lims()/xlim()/ylim(). coord_*() zooms axes, extreme values essentially fall page; contrast, latter three functions essentially remove values dataset. distinction matter simple bivariate scatterplot, likely mislead viewer two common scenarios. First, call geom_smooth() (e.g., overlays loess regression curve) ignore extreme values entirely; consequently summary location misplaced standard errors tight. Second, line graph spaghetti plots contains extreme value, sometimes desirable zoom primary area activity; calling coord_*(), trend line leave return plotting panel (implies points exist fit page), yet calling others, trend line appear interrupted, extreme point missing value.","code":""},{"path":"style.html","id":"style-ggplot-seed","chapter":"14 Style Guide","heading":"14.9.2.2 Seed","text":"jittering, set seed ‘declare-globals’ chunk rerunning report won’t create (slightly) different png. insignificantly different pngs consume extra space Git repository. Also, GitHub diff show difference png versions, requires extra subjectivity cognitive load determine difference due solely jittering, something really changed analysis.Occasionally ’ll want multiple graphs report consistent jitter, set seed prior ggplot() call. Lise DeShea’s 2015 book, Figures 3-21, 3-22, 3-23 needed similar possible inter-graph differences easier distinguish.","code":"\n# ---- declare-globals ---------------------------------------------------------\nset.seed(seed = 789) # Set a seed so the jittered graphs are consistent across renders.\n# ---- figure-03-21 ------------------------------------------------------\nset.seed(seed = 789)\nggplot(ds, aes(x = group, y = t1_lifts, fill = group)) +\n...\n\n# ---- figure-03-22 ------------------------------------------------------\nset.seed(seed = 789)\nggplot(ds, aes(x = group, y = t1_lifts, fill = group)) +\n...\n\n# ---- figure-03-23 ------------------------------------------------------\nset.seed(seed = 789)\nggplot(ds, aes(x = group, y = t1_lifts, fill = group)) +\n..."},{"path":"publication.html","id":"publication","chapter":"15 Publishing Results","heading":"15 Publishing Results","text":"","code":""},{"path":"publication.html","id":"publication-analysts","chapter":"15 Publishing Results","heading":"15.1 To Other Analysts","text":"","code":""},{"path":"publication.html","id":"publication-experts","chapter":"15 Publishing Results","heading":"15.2 To Researchers & Content Experts","text":"","code":""},{"path":"publication.html","id":"publication-phobic","chapter":"15 Publishing Results","heading":"15.3 To Technical-Phobic Audiences","text":"","code":""},{"path":"validation.html","id":"validation","chapter":"16 Validation","heading":"16 Validation","text":"","code":""},{"path":"validation.html","id":"validation-intro","chapter":"16 Validation","heading":"16.1 Intro","text":"learn tools efficiently generate informative descriptive reports, time invest almost always pays .Validating dataset serves many beneficial roles, includingexploring basic descriptive patterns,verifying understand variable’s definition,communicating team already understand,describing variation locations time periods,evaluating preliminary hypotheses, andassessing likelihood assumptions inferential models reasonable.","code":""},{"path":"validation.html","id":"validation-ad-hoc","chapter":"16 Validation","heading":"16.2 Ad-hoc Manual Inspections","text":"recommend starting basic question developing quick dirty report addresses immediate need. initial curiosity satisfied, consider report can evolve address future needs. One common evolutionary path report inform inferential model. second common path assimilated automated report frequently run.","code":""},{"path":"validation.html","id":"validation-inferential","chapter":"16 Validation","heading":"16.3 Inferential Support","text":"","code":""},{"path":"validation.html","id":"validation-inferential-background","chapter":"16 Validation","heading":"16.3.1 Brief Intro to Inferential Statistics","text":"Descriptive statistics differ inferential statistics. descriptive statistic concerns observed elements sample, average height range weakest strongest systolic blood pressure. fuzziness forecasting descriptive statistic –’s simply straight-forward equation observed points.10An inferential statistic tries reach beyond descriptive statistic: projects beyond observed sample. assesses pattern within collected sample likely exist larger population. Suppose group 40 newborns tended faster heart rates 33 infants. Stated differently, average 40 newborns faster average 33 infancts. large Student t (accompanying small p-value) may indicate difference exists among babies –just among 73. (Notice ’re comparing average two groups, saying slowest newborn still faster fastest infant)However order conclusions valid, several assumptions must met. See (Lise DeShea 2015) information t-test analyses commonly used health care.sense, t-test resembles broad category inferential statistics: validity assumptions can evaluated research design (e.g., kid measured independently), assumptions best evaluated data (e.g., residuals/errors follow approximate bell-shaped distribution).graphs useful assessing appropriateness inferential statistic:beginners: histogramsfor beginners: scatterplot observedfor beginners: plots residuals (.e., descrepancy point’s observed & predicted value)advanced users, see suite graphs built base RIn words, can help establish foundation justifies inferential statistic.important … comfortable inferential statistic reasonably meet assumptions conclusions valid.","code":""},{"path":"validation.html","id":"automated-reports","chapter":"16 Validation","heading":"16.4 Automated Reports","text":"two strategies (ad-hoc inspections inferential support) can connected. ad-hoc inspection enlightening, consider spending ~15 minutes making report easily reproducible things change. reasons report monitored repeatedly changes inTemporal Trends (e.g., dataset Jan 2020 Dec 2020 looks different Jan 2020 Dec 2022)Inclusion criteria (e.g., restrict list diagnosis code)Data Partner sites (e.g., new site contributes data patterns didn’t anticipate)","code":""},{"path":"testing.html","id":"testing","chapter":"17 Testing","heading":"17 Testing","text":"","code":""},{"path":"testing.html","id":"testing-functions","chapter":"17 Testing","heading":"17.1 Testing Functions","text":"","code":""},{"path":"testing.html","id":"validator","chapter":"17 Testing","heading":"17.2 Validator","text":"Benefits AnalystsBenefits Data Collectors","code":""},{"path":"troubleshooting.html","id":"troubleshooting","chapter":"18 Troubleshooting and Debugging","heading":"18 Troubleshooting and Debugging","text":"","code":""},{"path":"troubleshooting.html","id":"finding-help","chapter":"18 Troubleshooting and Debugging","heading":"18.1 Finding Help","text":"Within group (eg, Thomas REDCap questions)Within university (eg, SCUG)Outside (eg, Stack Overflow; GitHub issues)","code":""},{"path":"troubleshooting.html","id":"debugging","chapter":"18 Troubleshooting and Debugging","heading":"18.2 Debugging","text":"traceback(), browser(), etc","code":""},{"path":"workstation.html","id":"workstation","chapter":"19 Workstation","heading":"19 Workstation","text":"believe important keep software updated consistent across workstations project. material originally posted https://github.com/OuhscBbmc/RedcapExamplesAndPatterns/blob/main/DocumentationGlobal/ResourcesInstallation.md. help establish tools new development computer.","code":""},{"path":"workstation.html","id":"workstation-required","chapter":"19 Workstation","heading":"19.1 Required Installation","text":"installation order matters.","code":""},{"path":"workstation.html","id":"workstation-r","chapter":"19 Workstation","heading":"19.1.1 R","text":"R centerpiece analysis. Every months, ’ll need download recent version. {added Sept 2012}","code":""},{"path":"workstation.html","id":"workstation-rstudio","chapter":"19 Workstation","heading":"19.1.2 RStudio","text":"RStudio Desktop IDE (integrated design interface) ’ll use interact R, GitHub, Markdown. Updates can checked easily menus Help -> Check Updates. {added Sept 2012}Note: non-default changes facilitate workflow. Choose “Global Options” “Tools menu bar.General | Basic | Restore .RData wokspace startup: uncheckedGeneral | Basic | Save workspace >RData exit: neverGeneral | Basic | Always save history: uncheckedCode | Editing | Use native pipe operator, |>: checkedCode | Saving | Ensure source files end newline: checkedCode | Saving | Strip trailing horizontal whitespace saving: checkedSweave | Weave Rnw file using: knitr","code":""},{"path":"workstation.html","id":"workstation-rtools","chapter":"19 Workstation","heading":"19.1.3 R Tools","text":"R Tools Windows necessary build packages development hosted GitHub. running Windows, follow page’s instructions, especially “Putting Rtools PATH” section. running Linux, components R Tools likely already installed machine. {added Feb 2017}","code":""},{"path":"workstation.html","id":"workstation-r-package-installation","chapter":"19 Workstation","heading":"19.1.4 Installing R Packages","text":"Dozens R Packages need installed. Choose one two related scripts. install list packages data analysts typically need. script installs package ’s already installed; also existing package updated newer version available. Create new ‘personal library’ prompts . takes least fifteen minutes, start go lunch. list packages evolve time, please help keep list updated.install frequently-used packages, run following snippet. first lines installs important package. second line calls online Gist11, defines package_janitor_remote() function. function installs packages listed two CSVs, package-dependency-list.csv package-dependency-list-.csv.projects require specialized packages typically used. cases, develop git repo R package includes proper DESCRIPTION file. See RAnalysisSkeleton example.project opened RStudio, update_packages_addin() OuhscMunge find DESCRIPTION file install package dependencies.","code":"\nif (!base::requireNamespace(\"devtools\")) utils::install.packages(\"devtools\")\ndevtools::source_gist(\"2c5e7459b88ec28b9e8fa0c695b15ee3\", filename=\"package-janitor-bbmc.R\")\n\n# Important packages required by most BBMC projects\npackage_janitor_remote(\n  \"https://raw.githubusercontent.com/OuhscBbmc/RedcapExamplesAndPatterns/main/utility/package-dependency-list.csv\"\n)\n\n# Nonessential packages used in a few BBMC projects\npackage_janitor_remote(\n  \"https://raw.githubusercontent.com/OuhscBbmc/RedcapExamplesAndPatterns/main/utility/package-dependency-list-more.csv\"\n)\nif( !base::requireNamespace(\"remotes\"   ) ) utils::install.packages(\"remotes\")\nif( !base::requireNamespace(\"OuhscMunge\") ) remotes::install_github(\"OuhscBbmc/OuhscMunge\")\nOuhscMunge::update_packages_addin()"},{"path":"workstation.html","id":"workstation-r-package-update","chapter":"19 Workstation","heading":"19.1.5 Updating R Packages","text":"Several R packages need updated every weeks. Unless told (break something -rare), periodically update packages executing following code update.packages(checkBuilt = TRUE, ask = FALSE).","code":""},{"path":"workstation.html","id":"workstation-github","chapter":"19 Workstation","heading":"19.1.6 GitHub","text":"GitHub registration necessary push modified files repository. First, register free user account, tell repository owner exact username, add collaborator (e.g., https://github.com/OuhscBbmc/RedcapExamplesAndPatterns). {added Sept 2012}","code":""},{"path":"workstation.html","id":"workstation-github-client","chapter":"19 Workstation","heading":"19.1.7 GitHub Desktop","text":"GitHub Desktop basic tasks little easier git features built RStudio. client available Windows macOS. (Occasionally, someone might need use git command line fix problems, required start.) {added Sept 2012}","code":""},{"path":"workstation.html","id":"workstation-recommended","chapter":"19 Workstation","heading":"19.2 Recommended Installation","text":"installation order matter.","code":""},{"path":"workstation.html","id":"workstation-odbc","chapter":"19 Workstation","heading":"19.2.1 ODBC Driver","text":"ODBC Driver SQL Server connecting token server, institution using one. writing, version 18 recent driver version. See new one exists. {updated Feb 2022}","code":""},{"path":"workstation.html","id":"workstation-quarto","chapter":"19 Workstation","heading":"19.2.2 Quarto","text":"Quarto Posit’s/RStudio’s successor knitr. uses embedded version Pandoc translate R/Python/Julia code html pdf reports (via Markdown). Reporting reproducible research foundation workflow Quarto used upcoming generation reports. existing Rmd file delivering need (something like article federal report), continue using knitr R Markdown. developing new report scratch, strongly consider Quarto. {added Nov 2022}Quarto’s Get Started page instructions. ’ll want installed RStudio IDE, probably VS Code . See troubleshooting tips necessary.","code":""},{"path":"workstation.html","id":"workstation-notepadpp","chapter":"19 Workstation","heading":"19.2.3 Notepad++","text":"Notepad++ text editor allows look raw text files, code CSVs. CSVs data files, helpful troubleshooting (instead looking file Excel, masks & causes issues). {added Sept 2012}","code":""},{"path":"workstation.html","id":"workstation-ads","chapter":"19 Workstation","heading":"19.2.4 Azure Data Studio","text":"Azure Data Studio (ADS) now recommended Microsoft others analysts (roles) –ahead SQL Server Management Studio.Note: non-default changes facilitate workflow.Settings | Text Editor | Tab Size: 2 {\"editor.tabSize\": 2}Settings | Text Editor | Detect Indentation: uncheck {\"editor.detectIndentation\": false}Settings | Text Editor | Insert Final Newlines: check {\"files.insertFinalNewline\": true}Settings | Text Editor | Trim Final Newlines: check {\"files.trimFinalNewlines\": true}Settings | Text Editor | Trim Trailing Whitespace: check {\"files.trimTrailingWhitespace\": true}Data | Sql | Show Connection Info Title: uncheck {\"sql.showConnectionInfoInTitle\": false}Data | Sql | Include Headers: check {\"sql.copyIncludeHeaders\": false}","code":"{\n  \"workbench.enablePreviewFeatures\": true,\n  \"workbench.colorTheme\": \"Default Dark Azure Data Studio\",\n  \"editor.tabSize\": 2,\n  \"editor.detectIndentation\": false,\n  \"files.insertFinalNewline\": true,\n  \"files.trimFinalNewlines\": true,\n  \"files.trimTrailingWhitespace\": true,\n  \"queryEditor.showConnectionInfoInTitle\": false,\n  \"queryEditor.results.copyIncludeHeaders\": false\n}"},{"path":"workstation.html","id":"workstation-vscode","chapter":"19 Workstation","heading":"19.2.5 Visual Studio Code","text":"Visual Studio Code extensible text editor runs Windows Linux. ’s much lighter full Visual Studio. Like Atom, supports browsing directory structure, replacing across files, interaction git, previewing markdown. VS Code good documentation Basic Editing.Productivity VS Code enhanced following extensions: {added Dec 2018}Excel Viewer isn’t good name, ’ve liked capability. displays CSVs files grid. {added Dec 2018}Excel Viewer isn’t good name, ’ve liked capability. displays CSVs files grid. {added Dec 2018}Rainbow CSV color codes columns, still allows see edit raw plain-text file. {added Dec 2018}Rainbow CSV color codes columns, still allows see edit raw plain-text file. {added Dec 2018}SQL Server allows execute database, view/copy/save grid results. doesn’t replicate SSMS features, nice scanning files. {added Dec 2018}SQL Server allows execute database, view/copy/save grid results. doesn’t replicate SSMS features, nice scanning files. {added Dec 2018}Code Spell Checker produces green squiggly lines words dictionary. can add words user dictionary, project dictionary.Code Spell Checker produces green squiggly lines words dictionary. can add words user dictionary, project dictionary.Markdown One useful markdown capabilities, converting file html.Markdown One useful markdown capabilities, converting file html.Markdown PDF useful markdown capabilities, converting file pdf.Markdown PDF useful markdown capabilities, converting file pdf.markdownlint linting style checking.markdownlint linting style checking.extensions can installed command line.Note: non-default changes facilitate workflow. Either copy configuration settings.json, manually specify options settings editor.Settings | Extensions |Markdown One | Ordered List | Auto Renumber: false {\"markdown.extension.orderedList.autoRenumber\": false}Settings | Extensions |Markdown One | Ordered List | Marker: one {\"markdown.extension.orderedList.marker\": \"one\"}","code":"code --list-extensions\ncode --install-extension GrapeCity.gc-excelviewer\ncode --install-extension mechatroner.rainbow-csv\ncode --install-extension ms-mssql.mssql\ncode --install-extension streetsidesoftware.code-spell-checker\ncode --install-extension yzhang.markdown-all-in-one\ncode --install-extension yzane.markdown-pdf\ncode --install-extension DavidAnson.vscode-markdownlint{\n  \"diffEditor.ignoreTrimWhitespace\": false,\n  \"diffEditor.maxComputationTime\": 0,\n  \"editor.acceptSuggestionOnEnter\": \"off\",\n  \"editor.renderWhitespace\": \"all\",\n  \"explorer.confirmDragAndDrop\": false,\n  \"files.associations\": {\n      \"*.Rmd\": \"markdown\"\n  },\n  \"files.trimFinalNewlines\": true,\n  \"files.trimTrailingWhitespace\": true,\n  \"git.autofetch\": true,\n  \"git.confirmSync\": false,\n  \"window.zoomLevel\": 2,\n\n  \"markdown.extension.orderedList.autoRenumber\": false,\n  \"markdown.extension.orderedList.marker\": \"one\",\n  \"markdownlint.config\": {\n      \"MD003\": { \"style\": \"setext_with_atx\" },\n      \"MD007\": { \"indent\": 2 },\n      \"MD022\": { \"lines_above\": 1,\n                  \"lines_below\": 1 },\n      \"MD024\": { \"siblings_only\": true },\n      \"no-bare-urls\": false,\n      \"no-inline-html\": {\n        \"allowed_elements\": [\n          \"mermaid\",\n          \"a\",\n          \"br\",\n          \"details\",\n          \"img\"\n        ]\n      }\n  }\n}"},{"path":"workstation.html","id":"workstation-optional","chapter":"19 Workstation","heading":"19.3 Optional Installation","text":"installation order matter.","code":""},{"path":"workstation.html","id":"workstation-git","chapter":"19 Workstation","heading":"19.3.1 Git","text":"Git command-line utility enables advanced operations GitHub client doesn’t support. Use default installation options, except preferences :\n1. Nano default text editor.","code":""},{"path":"workstation.html","id":"workstation-calc","chapter":"19 Workstation","heading":"19.3.2 LibreOffice Calc","text":"LibreOffice Calc alternative Excel. Unlike Excel, doesn’t guess much formatting (usually mess things, especially dates).","code":""},{"path":"workstation.html","id":"workstation-pandoc","chapter":"19 Workstation","heading":"19.3.3 pandoc","text":"pandoc converts files one markup format another. {added Sept 2012}","code":""},{"path":"workstation.html","id":"workstation-python","chapter":"19 Workstation","heading":"19.3.4 Python","text":"Python used analysts. prototypical installation involves two options.Anaconda, include Jupyter Notebooks, Jupyter Lab, Spyder. Plus two programs already list: RStudio VS Code. Windows, open “Anaconda Prompt” administrative privileges\nconda install numpy pandas scikit-learn matplotlibAnaconda, include Jupyter Notebooks, Jupyter Lab, Spyder. Plus two programs already list: RStudio VS Code. Windows, open “Anaconda Prompt” administrative privilegesStandard Python, installing packages pip3 terminal. pip3 command unrecognized ’s missing OS path variable, alternative py -3 -mpip install paramiko; calls pip py command sometimes path variable installation.\nusing Windows .msi installer, recommended options \nCheck “Add Python 3.10 PATH”\nCheck “Install launcher users (recommended)”\nClick “Customize Installation”\nOptional Features\nCheck “Documentation”\nCheck “pip”\n“users (requires elevation)”\n\nAdvanced Options\nCheck “Install users” (set install path something like C:\\Program Files\\Python310.)\nCheck “Add Python environment variables”\nCheck “Precompile standard library”\n\nmsi completes:\nAdd entry like C:\\Users\\USERNAME\\AppData\\Roaming\\Python\\Python310 C:\\Users\\USERNAME\\AppData\\Local\\Programs\\Python\\Python310 System Variables scripts personal AppData directory (even clicked “Install users”). helps RStudio/reticulate run python scripts.\nInstall Python packages PowerShell command line (Python)\npy -3 -mpip install biopython matplotlib numpy pandas paramiko pyarrow pyodbc pyyaml scikit-learn scipy sqlalchemy strictyaml\n\nStandard Python, installing packages pip3 terminal. pip3 command unrecognized ’s missing OS path variable, alternative py -3 -mpip install paramiko; calls pip py command sometimes path variable installation.using Windows .msi installer, recommended options areCheck “Add Python 3.10 PATH”Check “Add Python 3.10 PATH”Check “Install launcher users (recommended)”Check “Install launcher users (recommended)”Click “Customize Installation”Click “Customize Installation”Optional Features\nCheck “Documentation”\nCheck “pip”\n“users (requires elevation)”\nOptional FeaturesCheck “Documentation”Check “pip”“users (requires elevation)”Advanced Options\nCheck “Install users” (set install path something like C:\\Program Files\\Python310.)\nCheck “Add Python environment variables”\nCheck “Precompile standard library”\nAdvanced OptionsCheck “Install users” (set install path something like C:\\Program Files\\Python310.)Check “Add Python environment variables”Check “Precompile standard library”msi completes:\nAdd entry like C:\\Users\\USERNAME\\AppData\\Roaming\\Python\\Python310 C:\\Users\\USERNAME\\AppData\\Local\\Programs\\Python\\Python310 System Variables scripts personal AppData directory (even clicked “Install users”). helps RStudio/reticulate run python scripts.\nInstall Python packages PowerShell command line (Python)\npy -3 -mpip install biopython matplotlib numpy pandas paramiko pyarrow pyodbc pyyaml scikit-learn scipy sqlalchemy strictyaml\nmsi completes:Add entry like C:\\Users\\USERNAME\\AppData\\Roaming\\Python\\Python310 C:\\Users\\USERNAME\\AppData\\Local\\Programs\\Python\\Python310 System Variables scripts personal AppData directory (even clicked “Install users”). helps RStudio/reticulate run python scripts.Add entry like C:\\Users\\USERNAME\\AppData\\Roaming\\Python\\Python310 C:\\Users\\USERNAME\\AppData\\Local\\Programs\\Python\\Python310 System Variables scripts personal AppData directory (even clicked “Install users”). helps RStudio/reticulate run python scripts.Install Python packages PowerShell command line (Python)\npy -3 -mpip install biopython matplotlib numpy pandas paramiko pyarrow pyodbc pyyaml scikit-learn scipy sqlalchemy strictyamlInstall Python packages PowerShell command line (Python)Updating Packages Python packages don’t need updated frequently R packages, ’s still good every months.\nPaste single line PowerShell Windows. (Stack Overflow solution Sébastien Wieckowski)\npip list -o --format json | ConvertFrom-Json | foreach {pip install $_.name -U ---warn-script-location}\nPaste single line Bash terminal Linux. (ActiveState.com post.\npip3 list --outdated --format=freeze | grep -v '^\\-e' | cut -d = -f 1 | xargs -n1 pip3 install -U Updating Packages Python packages don’t need updated frequently R packages, ’s still good every months.Paste single line PowerShell Windows. (Stack Overflow solution Sébastien Wieckowski)Paste single line Bash terminal Linux. (ActiveState.com post.","code":"conda install numpy pandas scikit-learn matplotlibpy -3 -mpip install biopython matplotlib numpy pandas paramiko pyarrow pyodbc pyyaml scikit-learn scipy sqlalchemy strictyamlpip list -o --format json | ConvertFrom-Json | foreach {pip install $_.name -U --no-warn-script-location}pip3 list --outdated --format=freeze | grep -v '^\\-e' | cut -d = -f 1 | xargs -n1 pip3 install -U "},{"path":"workstation.html","id":"workstation-pilot-edit","chapter":"19 Workstation","heading":"19.3.5 PilotEdit","text":"PilotEdit can load huge text files fit RAM, files 100MB choke Excel, Calc, Notepad++, Visual Studio Code.Like Notepad++ VS Code, PilotEdit good Find features can () present search hits within file, (b) scan multiple files, (c) use regular expressions. helps trace origin problems pipeline. example, data warehouse suspicious character patient 10009’s BMI value, regex \\b10009\\tbmi\\b locates origin among multiple 1+GB files received.PilotEdit also good tool occasional data extract encoding problem. can side--side inspect hex code (visible non-visible) character produced (example ascii, “76” produces “v” “0A” produces line feed). {Added Sept 2020}","code":""},{"path":"workstation.html","id":"workstation-assets","chapter":"19 Workstation","heading":"19.4 Asset Locations","text":"GitHub repository https://github.com/OuhscBbmc/RedcapExamplesAndPatterns {added Sept 2012}GitHub repository https://github.com/OuhscBbmc/RedcapExamplesAndPatterns {added Sept 2012}File server directory Ask PI. Peds, ’s typically “S” drive.File server directory Ask PI. Peds, ’s typically “S” drive.SQL Server Database Ask Thomas, DavidSQL Server Database Ask Thomas, DavidREDCap database Ask Thomas, David. http url, ’re trying publicize value.REDCap database Ask Thomas, David. http url, ’re trying publicize value.ODBC UserDsn name depends specific repository, SQL Server database. Ask Thomas, David set .ODBC UserDsn name depends specific repository, SQL Server database. Ask Thomas, David set .","code":""},{"path":"workstation.html","id":"workstation-administrator","chapter":"19 Workstation","heading":"19.5 Administrator Installation","text":"programs useful people administrating servers, typical data scientist.","code":""},{"path":"workstation.html","id":"workstation-mysql","chapter":"19 Workstation","heading":"19.5.1 MySQL Workbench","text":"MySQL Workbench useful occasionally REDCap admins.","code":""},{"path":"workstation.html","id":"workstation-postman","chapter":"19 Workstation","heading":"19.5.2 Postman","text":"Postman Native App useful developing API replaced Chrome app. ’s possible, web client available well. either program, access PHI.","code":""},{"path":"workstation.html","id":"workstation-ssms","chapter":"19 Workstation","heading":"19.5.3 SQL Server Management Studio (SSMS)","text":"SQL Server Management Studio replaced Azure Data Studio roles, still recommended database administrators. easy way access database write queries (transfer SQL R file). ’s required REDCap API, ’s usually necessary integrating REDCap databases.Note: non-default changes facilitate workflow. first two help save database structure (data) GitHub, can easily track/monitor structural changes time. tabs options keeps things consistent editors. SSMS ‘Tools | Options’ dialog box:SQL Server Object Explorer | Scripting | Include descriptive headers: FalseSQL Server Object Explorer | Scripting | Script extended properties: FalseText Editor | Languages | Tabs | Tab size: 2Text Editor | Languages | Tabs | Indent size: 2Text Editor | Languages | Tabs | Insert Spaces: trueThese don’t affect saved files, make life easier. first makes result font bigger.Environment | Fonts Colors | Show settings : Grid Results | Size: 10Query Results | SQL Server | Results Grid | Include column headers copying saving results: false`Designers | Table Database Designers | Prevent saving changes require table-recreation: falseText Editor | Editor Tab Status Bar | Tab Text | Include Server Name: falseText Editor | Editor Tab Status Bar | Tab Text | Include Database Name: falseText Editor | Editor Tab Status Bar | Tab Text | Include Login Name: falseText Editor | Languages | General | Line Numbers: trueA dark theme unofficially supported SSMS 18. write privileges “Program Files” directory, quick modification config file reduce eye strain. change also prevents screen flashing dark--light--dark, broadcasts wandering attention Zoom meeting.details, see setting--dev-machine.md (private repo ’s restricted BBMC members).","code":""},{"path":"workstation.html","id":"workstation-winscp","chapter":"19 Workstation","heading":"19.5.4 WinSCP","text":"WinSCP GUI SCP SFTP file transfer using SSH keys. tool occasionally useful admins collaborating institutions OU computing resources. PHI can accidentally sent collaborators without DUA, recommend WinSCP installed informed administrators. typical data scientist teams need tool.alternative FileZilla. works multiple OSes, currently doesn’t support scp (sftp).","code":""},{"path":"workstation.html","id":"workstation-troubleshooting","chapter":"19 Workstation","heading":"19.6 Installation Troubleshooting","text":"Git: Beasley resorted workaround Sept 2012: http://stackoverflow.com/questions/3431361/git--windows--program-cant-start--libiconv2-dll--missing. copied following four files D:/Program Files/msysgit/mingw/bin/ D:/Program Files/msysgit/bin/: (1) libiconv2.dll, (2) libcurl-4.dll, (3) libcrypto.dll, (4) libssl.dll. (install default location, ’ll move instead C:/msysgit/mingw/bin/ C:/msysgit/bin/) {added Sept 2012}Git: Beasley resorted workaround Sept 2012: http://stackoverflow.com/questions/3431361/git--windows--program-cant-start--libiconv2-dll--missing. copied following four files D:/Program Files/msysgit/mingw/bin/ D:/Program Files/msysgit/bin/: (1) libiconv2.dll, (2) libcurl-4.dll, (3) libcrypto.dll, (4) libssl.dll. (install default location, ’ll move instead C:/msysgit/mingw/bin/ C:/msysgit/bin/) {added Sept 2012}Git: different computer, Beasley couldn’t get RStudio recognize msysGit, installed Full installer official Git Windows 1.7.11 (http://code.google.com/p/msysgit/downloads/list) switched Git Path RStudio Options. {added Sept 2012}Git: different computer, Beasley couldn’t get RStudio recognize msysGit, installed Full installer official Git Windows 1.7.11 (http://code.google.com/p/msysgit/downloads/list) switched Git Path RStudio Options. {added Sept 2012}RStudio\nsomething goes wrong RStudio, re-installing might fix issue, personal preferences aren’t erased. safe, can thorough delete equivalent C:\\Users\\wibeasley\\AppData\\Local\\RStudio\\. options settings stored (can manipulated) extensionless text file: C:\\Users\\wibeasley\\AppData\\Local\\RStudio\\monitored\\user-settings\\user-settings. See RStudio’s support page, Resetting RStudio Desktop’s State. {added Sept 2012}\nHold ctrl button clicking RStudio Windows Start Menu. Try switching 64/32-bit option. VDI, forcing software-rendering option fixed problem RStudio window opened, nothing visible inside. {added Jan 2022}\nmight help look logs, stored equivalent C:\\Users\\wibeasley\\AppData\\Local\\RStudio\\logs {added Jan 2022}\nRStudioIf something goes wrong RStudio, re-installing might fix issue, personal preferences aren’t erased. safe, can thorough delete equivalent C:\\Users\\wibeasley\\AppData\\Local\\RStudio\\. options settings stored (can manipulated) extensionless text file: C:\\Users\\wibeasley\\AppData\\Local\\RStudio\\monitored\\user-settings\\user-settings. See RStudio’s support page, Resetting RStudio Desktop’s State. {added Sept 2012}Hold ctrl button clicking RStudio Windows Start Menu. Try switching 64/32-bit option. VDI, forcing software-rendering option fixed problem RStudio window opened, nothing visible inside. {added Jan 2022}might help look logs, stored equivalent C:\\Users\\wibeasley\\AppData\\Local\\RStudio\\logs {added Jan 2022}Quarto\n(rendering document) encounter error like compilation failed- matching packages ...LaTeX Error: File 'scrreprt.cls' found., ’ll need replace installation tinytex.\nFirst uinstall & remove via R.\n\ntinytex::uninstall_tinytex()\nremove.packages(\"tinytex\")\nreinstall via command line PowerShell.\nquarto tools install tinytex\nQuartoIf (rendering document) encounter error like compilation failed- matching packages ...LaTeX Error: File 'scrreprt.cls' found., ’ll need replace installation tinytex.\nFirst uinstall & remove via R.\n\ntinytex::uninstall_tinytex()\nremove.packages(\"tinytex\")\nreinstall via command line PowerShell.\nquarto tools install tinytexIf (rendering document) encounter error like compilation failed- matching packages ...LaTeX Error: File 'scrreprt.cls' found., ’ll need replace installation tinytex.First uinstall & remove via R.reinstall via command line PowerShell.","code":"\ntinytex::uninstall_tinytex()\nremove.packages(\"tinytex\")quarto tools install tinytex"},{"path":"workstation.html","id":"workstation-windows","chapter":"19 Workstation","heading":"19.7 Windows Installation","text":"","code":""},{"path":"workstation.html","id":"workstation-windows-explorer","chapter":"19 Workstation","heading":"19.7.1 File Explorer","text":"reviewing repo files, ’s frequently important see file extensions hidden files File Explorer.View Menu: check box “File name extensions”View Menu: check box “Hidden items”","code":""},{"path":"workstation.html","id":"workstation-ubuntu","chapter":"19 Workstation","heading":"19.8 Ubuntu Installation","text":"","code":""},{"path":"workstation.html","id":"workstation-ubuntu-r","chapter":"19 Workstation","heading":"19.8.1 R","text":"Check https://cran.r-project.org/bin/linux/ubuntu/ recent instructions.","code":"  ### Add the key, update the list, then install base R.\n  sudo apt update -qq\n  sudo apt install --no-install-recommends software-properties-common dirmngr\n  wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc\n  sudo add-apt-repository \"deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/\"\n  sudo apt-get install r-base r-base-dev"},{"path":"workstation.html","id":"workstation-ubuntu-rstudio","chapter":"19 Workstation","heading":"19.8.2 RStudio","text":"Download recent version https://www.rstudio.com/products/rstudio/download/#download. run two gdebi() lines.\nAlternatively, update wget line recent version.","code":"  # wget https://download1.rstudio.org/desktop/bionic/amd64/rstudio-1.4.1717-amd64.deb\n  sudo apt-get install gdebi-core\n  sudo gdebi rstudio-*-amd64.deb"},{"path":"workstation.html","id":"workstation-ubuntu-packages","chapter":"19 Workstation","heading":"19.8.3 apt-get Packages","text":"next block can copied pasted (ctrl-shift-v) console entirely. lines can pasted individual (without ( function install-packages { line, last three lines).","code":"( function install-packages {\n\n  ### Git\n  sudo apt-get install git-core\n  git config --global user.email \"wibeasley@hotmail.com\"\n  git config --global user.name \"Will Beasley\"\n  git config --global credential.helper 'cache --timeout=3600000'\n\n  ### Ubuntu & Bioconductor packages that are indirectly needed for packages and BBMC scripts\n\n  # Supports the `locate` command in bash\n  sudo apt-get install mlocate\n\n  # The genefilter package is needed for 'modeest' on CRAN.\n  # No longer a modeest dependency: Rscript -e 'BiocManager::install(\"genefilter\")'\n\n  ### CRAN packages that are also on the Ubuntu repositories\n\n  # The 'xml2' package; https://CRAN.R-project.org/package=xml2\n  sudo apt-get --yes install libxml2-dev r-cran-xml\n\n  # The 'curl' package, and others; https://CRAN.R-project.org/package=curl\n  sudo apt-get --yes install libssl-dev libcurl4-openssl-dev\n\n  # The 'udunits2' package: https://cran.r-project.org/web/packages/udunits2/index.html\n  sudo apt-get --yes install libudunits2-dev\n\n  # The 'odbc' package: https://github.com/r-dbi/odbc#linux---debian--ubuntu\n  sudo apt-get --yes install unixodbc-dev tdsodbc odbc-postgresql libsqliteodbc\n\n  # The 'rgl' package; https://stackoverflow.com/a/39952771/1082435\n  sudo apt-get --yes install libcgal-dev libglu1-mesa-dev\n\n  # The 'gsl' package; https://cran.rstudio.com/web/packages/gsl/INSTALL\n  sudo apt-get --yes install libgsl0-dev\n\n  # The 'magick' package; https://docs.ropensci.org/magick/articles/intro.html#build-from-source\n  sudo apt-get --yes install 'libmagick++-dev'\n\n  # To compress vignettes when building a package; https://kalimu.github.io/post/checklist-for-r-package-submission-to-cran/\n  sudo apt-get --yes install qpdf\n\n  # The 'pdftools' and 'Rpoppler' packages, which involve PDFs\n  sudo apt-get --yes install libpoppler-cpp-dev libpoppler-glib-dev\n\n  # The 'sys' package\n  sudo apt-get --yes install libapparmor-dev\n\n  # The 'archive' package; https://CRAN.R-project.org/package=archive\n  sudo apt-get --yes install libarchive-dev\n\n  # The 'sf' and other spatial packages: https://github.com/r-spatial/sf#ubuntu; https://github.com/r-spatial/sf/pull/1208\n  sudo apt-get --yes install libudunits2-dev libgdal-dev libgeos-dev libproj-dev libgeos++-dev\n\n  # For Cairo package, a dependency of Shiny & plotly; https://gykovacsblog.wordpress.com/2017/05/15/installing-cairo-for-r-on-ubuntu-17-04/\n  sudo apt-get --yes install libcairo2-dev\n\n  # 'rJava' and others; https://www.r-bloggers.com/installing-rjava-on-ubuntu/\n  sudo apt-get --yes install default-jre default-jdk\n  sudo R CMD javareconf\n  sudo apt-get --yes install r-cran-rjava\n\n  # For reprex and sometimes ssh keys; https://github.com/tidyverse/reprex#installation\n  sudo apt-get --yes install xclip\n\n  # gifski -apparently the rust compiler is necessary\n  sudo apt-get --yes install cargo\n\n  # For databases\n  sudo apt-get --yes install sqlite sqliteman\n  sudo apt-get --yes install postgresql postgresql-contrib pgadmin3\n\n  # pandoc\n  sudo apt-get --yes install pandoc\n\n  # For checking packages. Avoid `/usr/bin/texi2dvi: not found` warning.\n  sudo apt-get install texinfo\n}\ninstall-packages\n)"},{"path":"workstation.html","id":"workstation-ubuntu-pandoc","chapter":"19 Workstation","heading":"19.8.4 Pandoc","text":"version pandoc Ubuntu repository may delayed. install latest version, download .deb file install directory. Finally, verify version.","code":"sudo dpkg -i pandoc-*\npandoc -v"},{"path":"workstation.html","id":"workstation-ubuntu-postman","chapter":"19 Workstation","heading":"19.8.5 Postman","text":"Postman native app Ubuntu installed snap, updated daily automatically.","code":"snap install postman"},{"path":"workstation.html","id":"workstation-retired","chapter":"19 Workstation","heading":"19.9 Retired Tools","text":"previously installed software . replaced software ’s either newer natural use.GitLab SSL Certificate isn’t software, still needs configured.\nTalk server URL *.cer file.\nSave file something like ~/keys/ca-bundle-gitlab.cer\nAssociate file git config --global http.sslCAInfo ...path.../ca-bundle-gitlab.cer (replace ...path...).\nGitLab SSL Certificate isn’t software, still needs configured.Talk server URL *.cer file.Save file something like ~/keys/ca-bundle-gitlab.cerAssociate file git config --global http.sslCAInfo ...path.../ca-bundle-gitlab.cer (replace ...path...).MiKTeX necessary ’re using knitr Sweave produce LaTeX files (just markdown files). ’s huge, slow installation can take hour two. {added Sept 2012}MiKTeX necessary ’re using knitr Sweave produce LaTeX files (just markdown files). ’s huge, slow installation can take hour two. {added Sept 2012}Pulse Secure VPN client OUHSC researchers. ’s required REDCap API, ’s usually necessary communicate campus data sources.Pulse Secure VPN client OUHSC researchers. ’s required REDCap API, ’s usually necessary communicate campus data sources.msysGit allows RStudio track changes commit & sync GitHub server. Connect RStudio GitHub repository. moved optional (Oct 14, 2012) GitHub client (see ) almost everything RStudio plugin ; little better little robust; installation hasn’t given problems. {added Oct 2012}\nStarting top right RStudio, click: Project -> New Project -> Create Project Version Control -> Git {added Sept 2012}\nexample repository URL https://github.com/OuhscBbmc/RedcapExamplesAndPatterns. Specify location save (copy ) project local computer. {added Sept 2012}\nmsysGit allows RStudio track changes commit & sync GitHub server. Connect RStudio GitHub repository. moved optional (Oct 14, 2012) GitHub client (see ) almost everything RStudio plugin ; little better little robust; installation hasn’t given problems. {added Oct 2012}Starting top right RStudio, click: Project -> New Project -> Create Project Version Control -> Git {added Sept 2012}example repository URL https://github.com/OuhscBbmc/RedcapExamplesAndPatterns. Specify location save (copy ) project local computer. {added Sept 2012}CSVed lightweight program viewing data files. fits somewhere text editor Excel.CSVed lightweight program viewing data files. fits somewhere text editor Excel.SourceTree rich client many features GitHub client. don’t recommend beginners, since ways mess things. developers, nicely fills spot GitHub client command-line operations. branching visualization really nice . Unfortunately ironically, doesn’t currently support Linux. {added Sept 2014}.SourceTree rich client many features GitHub client. don’t recommend beginners, since ways mess things. developers, nicely fills spot GitHub client command-line operations. branching visualization really nice . Unfortunately ironically, doesn’t currently support Linux. {added Sept 2014}.git-cola probably best GUI Git supported Linux. ’s available official Ubuntu repositories apt-get (also see ). branch visualization features different, related program, ‘git dag’. {added Sept 2014}git-cola probably best GUI Git supported Linux. ’s available official Ubuntu repositories apt-get (also see ). branch visualization features different, related program, ‘git dag’. {added Sept 2014}GitHub Eclipse something discourage beginner, strongly recommend start RStudio (GitHub Client git capabilities within RStudio) months even consider Eclipse. ’s included list sake completeness. installing EGit plug-, ignore eclipse site check youtube video:http://www.youtube.com/watch?v=I7fbCE5nWPU.GitHub Eclipse something discourage beginner, strongly recommend start RStudio (GitHub Client git capabilities within RStudio) months even consider Eclipse. ’s included list sake completeness. installing EGit plug-, ignore eclipse site check youtube video:http://www.youtube.com/watch?v=I7fbCE5nWPU.Color Oracle simulates three common types color blindness. produce color graph report develop, check Color Oracle (ask someone else ). ’s already installed, takes less 10 second check three types color blindness. ’s installed, extra work may necessary Java isn’t already installed. download zip, extract ColorOracle.exe program like. {added Sept 2012}Color Oracle simulates three common types color blindness. produce color graph report develop, check Color Oracle (ask someone else ). ’s already installed, takes less 10 second check three types color blindness. ’s installed, extra work may necessary Java isn’t already installed. download zip, extract ColorOracle.exe program like. {added Sept 2012}Atom text editor, similar Notepad++. Notepad++ appears efficient opening large CSVs. Atom better suited editing lot files repository. finding replacing across lot files, superior Notepad++ RStudio; permits regexes great GUI preview potential replacements.\nProductivity enhanced following Atom packages:\nSublime Style Column Selection: Enable Sublime style ‘Column Selection’. Just hold ‘alt’ select, select using middle mouse button.\natom-language-r allows Atom recognize files R. prevents spell checking indicators enable syntax highlighting. need browse lot scattered R files quickly, Atom’s tree panel (left) works well. older alternative language-r.\nlanguage-csv: Adds syntax highlighting CSV files. highlighting nice, automatically disables spell checking lines.\natom-beautify: Beautify HTML, CSS, JavaScript, PHP, Python, Ruby, Java, C, C++, C#, Objective-C, CoffeeScript, TypeScript, Coldfusion, SQL, Atom.\natom-wrap--tag: wraps tag around selection; just select word phrase hit Alt + Shift + w.\nminimap: preview full source code (right margin).\nscript: Run scripts based file name, selection code, line number.\ngit-plus: git things without terminal (don’t think necessary anymore).\npackages can installed Atom, apm utility command line:\napm install sublime-style-column-selection atom-language-r language-csv atom-beautify atom-wrap--tag minimap script\nfollowing settings keep files consistent among developers.\nFile | Settings | Editor | Tab Length: 2 (opposed 3 4, used conventions)\nFile | Settings | Editor | Tab Type: soft (inserts 2 spaces instead tab ‘Tab’ pressed)\nAtom text editor, similar Notepad++. Notepad++ appears efficient opening large CSVs. Atom better suited editing lot files repository. finding replacing across lot files, superior Notepad++ RStudio; permits regexes great GUI preview potential replacements.Productivity enhanced following Atom packages:Sublime Style Column Selection: Enable Sublime style ‘Column Selection’. Just hold ‘alt’ select, select using middle mouse button.atom-language-r allows Atom recognize files R. prevents spell checking indicators enable syntax highlighting. need browse lot scattered R files quickly, Atom’s tree panel (left) works well. older alternative language-r.language-csv: Adds syntax highlighting CSV files. highlighting nice, automatically disables spell checking lines.atom-beautify: Beautify HTML, CSS, JavaScript, PHP, Python, Ruby, Java, C, C++, C#, Objective-C, CoffeeScript, TypeScript, Coldfusion, SQL, Atom.atom-wrap--tag: wraps tag around selection; just select word phrase hit Alt + Shift + w.minimap: preview full source code (right margin).script: Run scripts based file name, selection code, line number.git-plus: git things without terminal (don’t think necessary anymore).packages can installed Atom, apm utility command line:following settings keep files consistent among developers.File | Settings | Editor | Tab Length: 2 (opposed 3 4, used conventions)File | Settings | Editor | Tab Type: soft (inserts 2 spaces instead tab ‘Tab’ pressed)","code":"apm install sublime-style-column-selection atom-language-r language-csv atom-beautify atom-wrap-in-tag minimap script"},{"path":"tools.html","id":"tools","chapter":"20 Considerations when Selecting Tools","heading":"20 Considerations when Selecting Tools","text":"","code":""},{"path":"tools.html","id":"general","chapter":"20 Considerations when Selecting Tools","heading":"20.1 General","text":"","code":""},{"path":"tools.html","id":"the-components-goal","chapter":"20 Considerations when Selecting Tools","heading":"20.1.1 The Component’s Goal","text":"discussing advantages disadvantages tools, colleague said, “Tidyverse packages don’t anything can’t already Base R, sometimes even requires lines code”. Regardless agree, feel two points irrelevant. Sometimes advantage tool isn’t expand existing capabilities, rather facilitate development maintenance capability.Likewise, care less line count, readability. ’d prefer maintain 20-line chunk familiar readable 10-line chunk dense phrases unfamiliar functions. bottleneck projects human time, execution time.","code":""},{"path":"tools.html","id":"current-skill-set-of-team","chapter":"20 Considerations when Selecting Tools","heading":"20.1.2 Current Skill Set of Team","text":"","code":""},{"path":"tools.html","id":"desired-future-skill-set-of-team","chapter":"20 Considerations when Selecting Tools","heading":"20.1.3 Desired Future Skill Set of Team","text":"","code":""},{"path":"tools.html","id":"skill-set-of-audience","chapter":"20 Considerations when Selecting Tools","heading":"20.1.4 Skill Set of Audience","text":"","code":""},{"path":"tools.html","id":"languages","chapter":"20 Considerations when Selecting Tools","heading":"20.2 Languages","text":"","code":""},{"path":"tools.html","id":"r-packages","chapter":"20 Considerations when Selecting Tools","heading":"20.3 R Packages","text":"developing codebase used many people, choose packages functionality, well ease installation maintainability. example, rJava package powerful package allows R package developers leverage widespread Java framework many popular Java packages. However, installing Java setting appropriate path registry settings can error-prone, especially non-developers.\nTherefore considering two functions comparable capabilities (e.g., xlsx::read.xlsx() readxl::read_excel()), avoid package requires proper installation configuration Java rJava.\nintensive choice required (say, need capability xlsx missing readxl), take:\n20 minutes start markdown file enumerates package’s direct indirect dependencies require manual configuration (e.g., rJava Java), download , typical installation steps.\n5 minutes create GitHub Issue () announces new requirement, (b) describes /needs install requirement, (c) points markdown documentation, (d) encourages teammates post problems, recommendations, solutions issue. ’ve found dedicated Issue helps communicate package dependency necessitates intention encourages people assist people’s troubleshooting. something potentially useful posted Issue, move markdown document. Make sure document issue hyperlink .\n15 minutes every year re-evaluate landscape. Confirm package still actively maintained, newer (easily- maintained) package offers desired capability.12 better fit now exists, evaluate effort transition new package worth benefit. willing transition project relatively green, development upcoming. willing transition transition relatively -place, require much modification code training people.\ndeveloping codebase used many people, choose packages functionality, well ease installation maintainability. example, rJava package powerful package allows R package developers leverage widespread Java framework many popular Java packages. However, installing Java setting appropriate path registry settings can error-prone, especially non-developers.Therefore considering two functions comparable capabilities (e.g., xlsx::read.xlsx() readxl::read_excel()), avoid package requires proper installation configuration Java rJava.intensive choice required (say, need capability xlsx missing readxl), take:20 minutes start markdown file enumerates package’s direct indirect dependencies require manual configuration (e.g., rJava Java), download , typical installation steps.20 minutes start markdown file enumerates package’s direct indirect dependencies require manual configuration (e.g., rJava Java), download , typical installation steps.5 minutes create GitHub Issue () announces new requirement, (b) describes /needs install requirement, (c) points markdown documentation, (d) encourages teammates post problems, recommendations, solutions issue. ’ve found dedicated Issue helps communicate package dependency necessitates intention encourages people assist people’s troubleshooting. something potentially useful posted Issue, move markdown document. Make sure document issue hyperlink .5 minutes create GitHub Issue () announces new requirement, (b) describes /needs install requirement, (c) points markdown documentation, (d) encourages teammates post problems, recommendations, solutions issue. ’ve found dedicated Issue helps communicate package dependency necessitates intention encourages people assist people’s troubleshooting. something potentially useful posted Issue, move markdown document. Make sure document issue hyperlink .15 minutes every year re-evaluate landscape. Confirm package still actively maintained, newer (easily- maintained) package offers desired capability.12 better fit now exists, evaluate effort transition new package worth benefit. willing transition project relatively green, development upcoming. willing transition transition relatively -place, require much modification code training people.15 minutes every year re-evaluate landscape. Confirm package still actively maintained, newer (easily- maintained) package offers desired capability.12 better fit now exists, evaluate effort transition new package worth benefit. willing transition project relatively green, development upcoming. willing transition transition relatively -place, require much modification code training people.Finally, consider much traffic passes dependency brittle dependency disruptive isolated downstream analysis file run one statistician. hand, protective middle pipeline typically team runs.","code":""},{"path":"tools.html","id":"database","chapter":"20 Considerations when Selecting Tools","heading":"20.4 Database","text":"Ease installation & maintenanceEase installation & maintenanceSupport –database engine comfortable supporting.Support –database engine comfortable supporting.Integration LDAP, Active Directory, Shibboleth.Integration LDAP, Active Directory, Shibboleth.Warehouse vs transactional performanceWarehouse vs transactional performance","code":""},{"path":"tools.html","id":"additional-resources-2","chapter":"20 Considerations when Selecting Tools","heading":"20.5 Additional Resources","text":"(Colin Gillespie 2017), particularly “Package selection” section.","code":""},{"path":"team.html","id":"team","chapter":"21 Growing a Team","heading":"21 Growing a Team","text":"","code":""},{"path":"team.html","id":"recruiting","chapter":"21 Growing a Team","heading":"21.1 Recruiting","text":"","code":""},{"path":"team.html","id":"training-to-data-science","chapter":"21 Growing a Team","heading":"21.2 Training to Data Science","text":"Starting ResearcherStarting StatisticianStarting DBAStarting Software Developer","code":""},{"path":"team.html","id":"bridges-outside-the-team","chapter":"21 Growing a Team","heading":"21.3 Bridges Outside the Team","text":"Monthly User GroupsAnnual Conferences","code":""},{"path":"redcap-user.html","id":"redcap-user","chapter":"22 Material for REDCap Users","heading":"22 Material for REDCap Users","text":"","code":""},{"path":"redcap-user.html","id":"redcap-user-login","chapter":"22 Material for REDCap Users","heading":"22.1 Login","text":"","code":""},{"path":"redcap-user.html","id":"redcap-user-report-develop","chapter":"22 Material for REDCap Users","heading":"22.2 Developing Reports","text":"Please first read Login","code":""},{"path":"redcap-developer.html","id":"redcap-developer","chapter":"23 Material for REDCap Developers","heading":"23 Material for REDCap Developers","text":"","code":""},{"path":"redcap-admin.html","id":"redcap-admin","chapter":"24 Material for REDCap Admins","heading":"24 Material for REDCap Admins","text":"","code":""},{"path":"git.html","id":"git","chapter":"A Git & GitHub","heading":"A Git & GitHub","text":"","code":""},{"path":"git.html","id":"git-justification","chapter":"A Git & GitHub","heading":"A.1 Justification","text":"(Written 2017 justify service corporation’s department.)Git GitHub de facto version control software hosting solution software development modern data science. Using GitHub help group three critical tasks: () developing software, (b) leveraging innovations others, (c) attracting top talent.Developing Software: Version control critical developing quality software, especially multiple data scientists contributing code bank. Among modern version control software, Git GitHub popular new projects, especially among talent pool recruit . Compared outdated approaches using conventional file-servers, version control substantially increases productivity. Analysts can develop code & report parallel, combine branch mature. Additionally, commits saved indefinitely, allowing us ‘turn back clock’ resurrect older code necessary. also allows us organize manage proprietary code single (distributed) location.Given needs small data science team, believe private GitHub repositories (secured two-factor authentication) strike nice balance () security, (b) ease use developers, (c) ease maintenance administrators, (d) cost.Leveraging Innovation: cutting-edge data science algorithms released GitHub. algorithms stand-alone software; instead augment statistical software, R, approved . Furthermore, GitHub.com hosts documentation user forums data science algorithms. Without access information, greater risk misunderstanding misusing routines, weaken accuracy financial reports produce.Attracting Talent: compete top talent highly competitive field data science, want provide access standard tools. want send message organization doesn’t value advancements appreciated employed competitors.Alternatives: GitHub approach described common, approached endorsed contemporary developers. Others include:GitHub Enterprise: hosting solution developed GitHub, hosted university-controlled VM.GitLab: competitor GitHub. GitLab uses Git, different hosting options, cloud -premises.Mercurial: modern version control similar Git. many Git’s strengths avoids many undesirable features Subversion/SVN.Atlassian: competitor GitHub focuses businesses. Altassian/Bitbucket repositories can use Git Mercurial. Like GitHub GitLab, offers different hosting options.Resources:GitHub BusinessGit Teams","code":""},{"path":"git.html","id":"git-code","chapter":"A Git & GitHub","heading":"A.2 for Code Development","text":"Jenny Bryan Jim Hester published thorough description using Git data scientist’s perspective (Happy Git GitHub useR), recommend following guidance. consistent approach, exceptions noted . complementary resource Team Geek, insightful advice human collaborative aspects version control.ResourcesSetting CI/CD Process GitHub Travis CI. Travis-CI blob August 2019.","code":""},{"path":"git.html","id":"git-collaboration","chapter":"A Git & GitHub","heading":"A.3 for Collaboration","text":"Somewhat separate ’s version control capabilities, GitHub provides built-tools coordinating projects across people time. tools revolves around GitHub Issues, allow teammates toSomewhat separate ’s version control capabilities, GitHub provides built-tools coordinating projects across people time. tools revolves around GitHub Issues, allow teammates totrack issues assigned otherstrack issues assigned otherssearch teammates encountered similar problems facing now (e.g., new computer can’t install rJava package).search teammates encountered similar problems facing now (e.g., new computer can’t install rJava package).’s nothing magical GitHub issues, don’t use , consider using similar capable tools like offered Atlassian, Asana, Basecamp, many others.tips experiences projects involving 2 10 statisticians working upcoming deadline.create error describes problem blocking progress, include raw text (e.g., error: JAVA_HOME determined Registry) possibly screenshot. text allows problem easily searched people later; screenshot usually provides extra context allows understand situation help quickly.create error describes problem blocking progress, include raw text (e.g., error: JAVA_HOME determined Registry) possibly screenshot. text allows problem easily searched people later; screenshot usually provides extra context allows understand situation help quickly.Include enough broad context enough specific details teammates can quickly understand problem. Ideally can even run code debug . Good recommendations can found Stack Overflow posts, ‘make great R reproducible example’ ‘ask good question?’. issues don’t need thorough, teammates start context Stack Overflow reader.\ntypically include\ndescription problem fishy behavior.\nexact error message (good description fishy behavior).\nsnippet 1-10 lines code suspected causing problem.\nlink code’s file (ideally line number, https://github.com/OuhscBbmc/REDCapR/blob/main/R/redcap-version.R#L40) reader can hop entire file.\nreferences similar GitHub Issues Stack Overflow questions aid troubleshooting.\nInclude enough broad context enough specific details teammates can quickly understand problem. Ideally can even run code debug . Good recommendations can found Stack Overflow posts, ‘make great R reproducible example’ ‘ask good question?’. issues don’t need thorough, teammates start context Stack Overflow reader.typically includea description problem fishy behavior.description problem fishy behavior.exact error message (good description fishy behavior).exact error message (good description fishy behavior).snippet 1-10 lines code suspected causing problem.snippet 1-10 lines code suspected causing problem.link code’s file (ideally line number, https://github.com/OuhscBbmc/REDCapR/blob/main/R/redcap-version.R#L40) reader can hop entire file.link code’s file (ideally line number, https://github.com/OuhscBbmc/REDCapR/blob/main/R/redcap-version.R#L40) reader can hop entire file.references similar GitHub Issues Stack Overflow questions aid troubleshooting.references similar GitHub Issues Stack Overflow questions aid troubleshooting.","code":""},{"path":"git.html","id":"git-stability","chapter":"A Git & GitHub","heading":"A.4 for Stability","text":"Review Git commits closely\nunintended functional difference (e.g., !match accidentally changed match).\nPHI snuck (e.g., patient ID used isolating debugging).\nmetadata format didn’t change (e.g., Excel sometimes changes string ‘010’ number ‘10’). See appendix longer discussion problems Excel typically introduces.\nReview Git commits closelyNo unintended functional difference (e.g., !match accidentally changed match).PHI snuck (e.g., patient ID used isolating debugging).metadata format didn’t change (e.g., Excel sometimes changes string ‘010’ number ‘10’). See appendix longer discussion problems Excel typically introduces.","code":""},{"path":"git.html","id":"organization-wide-defaults-and-practices","chapter":"A Git & GitHub","heading":"A.5 Organization-wide defaults and practices","text":"core-wide goal secure default applies GitHub . security measures added explicitly (e.g., .gitignore blocking common data files like *.csv & *.xlsx), organization-wide settings make new repo secure soon initialized, even cost accessibility.DefaultsTwo-factor authentication required organization members outside collaborators. See setting “Security” => “Two-factor authentication”Two-factor authentication required organization members outside collaborators. See setting “Security” => “Two-factor authentication”Organization members restricted creating repositories. See setting “Member privileges” => “Repository creation”.Organization members restricted creating repositories. See setting “Member privileges” => “Repository creation”.Organization members zero permissions new repositories. See setting “Member privileges” => “Default repository permission”\n.Organization members zero permissions new repositories. See setting “Member privileges” => “Default repository permission”\n.PracticesAuthorized teammates outside OUHSC designated outside collaborators, instead “members”.Authorized teammates outside OUHSC designated outside collaborators, instead “members”.three people owners GitHub organization. Everyone else must explicitly added appropriate repository. important restrictions members include () add/delete/transfer (private public) repositories (b) add/delete members organization.three people owners GitHub organization. Everyone else must explicitly added appropriate repository. important restrictions members include () add/delete/transfer (private public) repositories (b) add/delete members organization.Every week, owner (probably (wibeasley?)) review organization’s audit log (owners can view).Every week, owner (probably (wibeasley?)) review organization’s audit log (owners can view).Two owners must discuss agree upon adding/modifying/deleting extra entity added GitHub Organization, including\nwebhooks,\nthird-party applications,\ninstalled integration, \nOAuth applications.\nCurrently, approved entity Codecov integration, helps us test package code quantify coverage (“Improve code quality. Expose bugs security vulnerabilities.”). Codecov must explicitly turned desired repository.Two owners must discuss agree upon adding/modifying/deleting extra entity added GitHub Organization, includingwebhooks,third-party applications,installed integration, andOAuth applications.Currently, approved entity Codecov integration, helps us test package code quantify coverage (“Improve code quality. Expose bugs security vulnerabilities.”). Codecov must explicitly turned desired repository.","code":""},{"path":"git.html","id":"git-collaborators","chapter":"A Git & GitHub","heading":"A.6 for New Collaborators","text":"","code":""},{"path":"git.html","id":"git-contribution","chapter":"A Git & GitHub","heading":"A.7 Steps for Contributing to Repo","text":"","code":""},{"path":"git.html","id":"git-contribution-regular","chapter":"A Git & GitHub","heading":"A.7.1 Regular Contributions","text":"","code":""},{"path":"git.html","id":"git-contribution-regular-pull","chapter":"A Git & GitHub","heading":"A.7.1.1 Keep your dev branch fresh","text":"recommend least every day write code repo. Perhaps frequently lot developers pushing code (e.g., right reporting deadline).Update “main” branch local machine (GitHub server)Merge main local dev branchPush local dev branch GitHub server","code":""},{"path":"git.html","id":"git-contribution-regular-push","chapter":"A Git & GitHub","heading":"A.7.1.2 Make your code contributions available to other analysts","text":"least every days, push changes main branch teammates can benefit work. Especially improving pipeline code (e.g. Ellises REDCap Arches)Make sure dev branch updated immediately create Pull Request. Follow steps .Verify merged code still works expected. words, make sure new code blended newest main code, nothing breaks. Depending repo, steps might include\nBuild Check repo (assuming rep also package).\nRun code verify’s basic functionality repo. (example, MIECHV team run “high-school-funnel.R” verify assertions passed).\nBuild Check repo (assuming rep also package).Run code verify’s basic functionality repo. (example, MIECHV team run “high-school-funnel.R” verify assertions passed).Commit changes dev branch push GitHub server.Create Pull Request (otherwise known PR) assign reviewer. (example, developers MIECHV team paired together review ’s code.)reviewer pull dev branch local machine run checks verification (2nd step ). duplicate effort helps verify code likely works everyone machines.reviewer accepts PR main branch now contains changes available teammates.","code":""},{"path":"git.html","id":"main-vs-master-branch","chapter":"A Git & GitHub","heading":"A.7.1.3 “Main” vs “Master” Branch","text":"using old repo (initialized 2021) whose default branch still called “master”, ’s fairly simple rename “main” server.client, two options. first delete reclone (make sure everything pushed central repo deleting). second open command prompt (Window’s cmd, Window’s PowerShell, Linux bash) paste four lines.","code":"git branch -m master main\ngit fetch origin\ngit branch -u origin/main main\ngit remote set-head origin -a"},{"path":"git.html","id":"repo-style","chapter":"A Git & GitHub","heading":"A.8 Repo Style","text":"Please see Code Repositories section Style Guide chapter.{Transfer & update material https://github.com/OuhscBbmc/BbmcResources/blob/main/instructions/github.md}","code":""},{"path":"regex.html","id":"regex","chapter":"B Regular Expressions","heading":"B Regular Expressions","text":"“regular expression” (commonly called “regex”) allows programmer leverage pattern identifies (possibly extracts) nuggets information buried within text fields otherwise unparsable. can’t comfortable regexes data sciencing. learn new regex capabilities, ’ll see opportunities extract information efficiency integrity.Regexes may confusing first (may always remain little confusing) following resources help become proficient.Tools:http://regex101.com easy tool developing testing regex patterns replacements. Cool features include () panel thorough explanation every characteristic regex (b) ability save regex publicly share collaborators. supports different flavors –latest PCRE version corresponds R’s regex engine.\ntransferring regex website R, don’t forget “backslash backslashes”. words, regex pattern \\d{3} (matches three consecutive digits), declare R variable pattern <- \"\\\\d{3}\".http://regex101.com easy tool developing testing regex patterns replacements. Cool features include () panel thorough explanation every characteristic regex (b) ability save regex publicly share collaborators. supports different flavors –latest PCRE version corresponds R’s regex engine.transferring regex website R, don’t forget “backslash backslashes”. words, regex pattern \\d{3} (matches three consecutive digits), declare R variable pattern <- \"\\\\d{3}\".Books:Regular Expressions Chapter R Data Science, 2nd edition.Introducing Regular ExpressionsRegular Expressions Cookbook, 2nd EditionMastering Regular Expressions, 3rd editionPresentations:Regex SCUG Presentation","code":""},{"path":"snippets.html","id":"snippets","chapter":"C Snippets","heading":"C Snippets","text":"","code":""},{"path":"snippets.html","id":"snippets-reading","chapter":"C Snippets","heading":"C.1 Reading External Data","text":"","code":""},{"path":"snippets.html","id":"snippets-reading-excel","chapter":"C Snippets","heading":"C.1.1 Reading from Excel","text":"Background: Avoid Excel reasons previously discussed. isn’t another good option, protective. readxl::read_excel() allows specify column types, column order. names col_types ignored readxl::read_excel(). defend roaming columns (e.g., files changed time), tesit::assert() order expect.See readxl vignette, Cell Column Types, info.Last Modified: 2019-12-12 ","code":"\n# ---- declare-globals ---------------------------------------------------------\nconfig                         <- config::get()\n\n# cat(sprintf('  `%s`             = \"text\",\\n', colnames(ds)), sep=\"\") # 'text' by default --then change where appropriate.\ncol_types <- c(\n  `Med Rec Num`     = \"text\",\n  `Admit Date`      = \"date\",\n  `Tot Cash Pymt`   = \"numeric\"\n)\n\n# ---- load-data ---------------------------------------------------------------\nds <- readxl::read_excel(\n  path      = config$path_admission_charge,\n  col_types = col_types\n  # sheet   = \"dont-use-sheets-if-possible\"\n)\n\ntestit::assert(\n  \"The order of column names must match the expected list.\",\n  names(col_types) == colnames(ds)\n)\n\n# Alternatively, this provides more detailed error messages than `testit::assert()`\n# testthat::expect_equal(\n#   colnames(d),\n#   names(col_types),\n#   label = \"worksheet's column name (x)\",\n#   expected.label = \"col_types' name (y)\"\n# )"},{"path":"snippets.html","id":"snippets-reading-trailing-comma","chapter":"C Snippets","heading":"C.1.2 Removing Trailing Comma from Header","text":"Background: Occasionally Meditech Extract extra comma end 1st line. subsequent line, readr:read_csv() appropriately throws new warning missing column. warning flood can mask real problems.Explanation: snippet () reads csv plain text, (b) removes final comma, (c) passes plain text readr::read_csv() convert data.frame.Instruction: Modify Dx50 Name name final (real) column.Real Example: truong-pharmacist-transition-1 (Accessible CDW members.)Last Modified: 2019-12-12 ","code":"\n# The next two lines remove the trailing comma at the end of the 1st line.\nraw_text  <- readr::read_file(path_in)\nraw_text  <- sub(\"^(.+Dx50 Name),\", \"\\\\1\", raw_text)\n\nds        <- readr::read_csv(raw_text, col_types=col_types)"},{"path":"snippets.html","id":"snippets-reading-vroom","chapter":"C Snippets","heading":"C.1.3 Removing Trailing Comma from Header","text":"Background: incoming data files large side comfortably accept readr, use vroom. two packages developed group might combined future.Explanation: snippet defines col_types list names mimic approach using readr. small differences readr approach:\n1. col_types list instead readr::cols_only object.\n1. call vroom::vroom() passes col_names = names(col_types) explicitly.\n1. data file contains columns don’t need, define col_types anyway; vroom needs know file structure ’s missing header row.Real Example: akande-medically-complex-1 (Accessible CDW members.) Thesee files header variable names; first line file first data row.Last Modified: 2020-08-21 ","code":"\n# ---- declare-globals ---------------------------------------------------------\nconfig            <- config::get()\n\ncol_types <- list(\n  sak                      = vroom::col_integer(),  # \"system-assigned key\"\n  aid_category_id          = vroom::col_character(),\n  age                      = vroom::col_integer(),\n  service_date_first       = vroom::col_date(\"%m/%d/%Y\"),\n  service_date_lasst       = vroom::col_date(\"%m/%d/%Y\"),\n  claim_type               = vroom::col_character(),\n  provider_id              = vroom::col_character(),\n  provider_lat             = vroom::col_double(),\n  provider_long            = vroom::col_double(),\n  provider_zip             = vroom::col_character(),\n  cpt                      = vroom::col_integer(),\n  revenue_code             = vroom::col_integer(),\n  icd_code                 = vroom::col_character(),\n  icd_sequence             = vroom::col_integer(),\n  vocabulary_coarse_id     = vroom::col_integer()\n)\n\n# ---- load-data ---------------------------------------------------------------\nds <- vroom::vroom(\n  file      = config$path_ohca_patient,\n  delim     = \"\\t\",\n  col_names = names(col_types),\n  col_types = col_types\n)\n\nrm(col_types)"},{"path":"snippets.html","id":"snippets-grooming","chapter":"C Snippets","heading":"C.2 Grooming","text":"","code":""},{"path":"snippets.html","id":"snippets-grooming-two-year","chapter":"C Snippets","heading":"C.2.1 Correct for misinterpreted two-digit year","text":"Background: Sometimes Meditech dates specified like 1/6/54 instead 1/6/1954. readr::read_csv() choose year supposed ‘1954’ ‘2054’. human can use context guess birth date past (guesses 1954), readr can’t (guesses 2054). avoid problems, request dates ISO-8601 format.Explanation: Correct dplyr::mutate() clause; compare date value today. date today , use ; day future, subtract 100 years.Instruction: future dates loan payments, direction flip.Last Modified: 2019-12-12 ","code":"\n ds |>\n dplyr::mutate(\n    dob = dplyr::if_else(dob <= Sys.Date(), dob, dob - lubridate::years(100))\n  )"},{"path":"snippets.html","id":"snippets-identification","chapter":"C Snippets","heading":"C.3 Identification","text":"","code":""},{"path":"snippets.html","id":"snippets-identification-tags","chapter":"C Snippets","heading":"C.3.1 Generating “tags”","text":"Background: need generate unique identification values future people/clients/patients, described style guide.Explanation: snippet create 5-row csv random 7-character “tags” send research team collecting patients. TheInstruction: Set pt_count, tag_length, path_out, execute. Add rename columns appropriate domain (e.g., change “patient tag” “store tag”).Last Modified: 2019-12-30 WillThe resulting dataset look like , different randomly-generated tags.","code":"\npt_count    <- 5L   # The number of rows in the dataset.\ntag_length  <- 7L   # The number of characters in each tag.\npath_out    <- \"data-private/derived/pt-pool.csv\"\n\ndraw_tag <- function (tag_length = 4L, urn = c(0:9, letters)) {\n  paste(sample(urn, size = tag_length, replace = T), collapse = \"\")\n}\n\nds_pt_pool <-\n  tibble::tibble(\n    pt_index    = seq_len(pt_count),\n    pt_tag      = vapply(rep(tag_length, pt_count), draw_tag, character(1)),\n    assigned    = FALSE,\n    name_last   = \"--\",\n    name_first  = \"--\"\n  )\n\nreadr::write_csv(ds_pt_pool, path_out)# A tibble: 5 x 5\n  pt_index pt_tag  assigned name_last name_first\n                  \n1        1 seikyfr FALSE    --        --\n2        2 voiix4l FALSE    --        --\n3        3 wosn4w2 FALSE    --        --\n4        4 jl0dg84 FALSE    --        --\n5        5 r5ei5ph FALSE    --        --"},{"path":"snippets.html","id":"snippets-correspondence","chapter":"C Snippets","heading":"C.4 Correspondence with Collaborators","text":"","code":""},{"path":"snippets.html","id":"snippets-correspondence-excel","chapter":"C Snippets","heading":"C.4.1 Excel files","text":"Receiving storing Excel files almost always avoided reasons explained letter.receive extracts Excel files frequently, following request ready email person sending us Excel files. Adapt bold values like “109.19” situation. familiar tools, suggest alternative saving file csv. presented Excel gotchas, almost everyone ‘aha’ moment recognizes problem. Unfortunately, everyone flexible software can adapt easily.[Start letter]Sorry tedious, please resend extract csv file? Please call questions.Excel helpful values, essentially corrupting . example, values like 109.19 interpreted number, character code (e.g., see cell L14). limitations finite precision, becomes 109.18999999999999773. can’t round , values column cast numbers, V55.0. Furthermore, “E”s codes incorrectly interpreted exponent operator (e.g., “4E5” converted 400,000).\nFinally, values like 001.0 converted number leading trailing zeros dropped (cells like “1” distinguishable “001.0”).Unfortunately problems exist Excel file . import columns text, values already corrupted state.Please compress/zip csv file large email. ’ve found Excel file typically 5-10 times larger compressed csv.much Excel interferes medical variables, ’re lucky. messed branches science much worse. Genomics using far late realized mistakes.happened? default, Excel popular spreadsheet applications convert gene symbols dates numbers. example, instead writing “Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase,” researchers dubbed gene MARCH1. Excel converts date—03/01/2016, say—’s probably majority spreadsheet users mean type cell. Similarly, gene identifiers like “2310009E13” converted exponential numbers (2.31E+19). cases, conversions strip valuable information genes question.[End letter]","code":""},{"path":"presentations.html","id":"presentations","chapter":"D Presentations","heading":"D Presentations","text":"collection presentations BBMC friends may help demonstrate concepts discussed previous chapters.","code":""},{"path":"presentations.html","id":"presentations-crdw","chapter":"D Presentations","heading":"D.1 CRDW","text":"prairie-outpost-public: Documentation starter files OUHSC’s Clinical Data Warehouse.OUHSC CDW","code":""},{"path":"presentations.html","id":"presentations-redcap","chapter":"D Presentations","heading":"D.2 REDCap","text":"Secure Medical Data Collection - Best Practices Excel, Leveling REDCap & CollaboratoR. R/Medicine 2021, Virtual. Accompanying vignette: Typical REDCap Workflow Data Analyst.Secure Medical Data Collection - Best Practices Excel, Leveling REDCap & CollaboratoR. R/Medicine 2021, Virtual. Accompanying vignette: Typical REDCap Workflow Data Analyst.REDCap Systems Integration. REDCap Con 2015, Portland, Oregon.REDCap Systems Integration. REDCap Con 2015, Portland, Oregon.Literate Programming Patterns Practices REDCap REDCap Con 2014, Park City, Utah.Literate Programming Patterns Practices REDCap REDCap Con 2014, Park City, Utah.Interacting REDCap API using REDCapR Package REDCap Con 2014, Park City, Utah.Interacting REDCap API using REDCapR Package REDCap Con 2014, Park City, Utah.Optimizing Study Management using REDCap, R, software tools. SCUG 2013.Optimizing Study Management using REDCap, R, software tools. SCUG 2013.","code":""},{"path":"presentations.html","id":"presentations-reproducible","chapter":"D Presentations","heading":"D.3 Reproducible Research & Visualization","text":"Building pipelines dashboards practitioners: Mobilizing knowledge reproducible reporting. Displaying Health Data Colloquium 2018, University Victoria.Interactive reports webpages R & Shiny. SCUG 2015.Big data, big analysis: collaborative framework multistudy replication. Conventional Canadian Psychological Association, Victoria BC, 2016.WATS: wrap-around time series: Code accompany WATS Plot article, 2014.","code":""},{"path":"presentations.html","id":"presentations-data-management","chapter":"D Presentations","heading":"D.4 Data Management","text":"BBMC Validator: catch communicate data errors. SCUG 2016.Text manipulation Regular Expressions, Part 1 Part 2. SCUG 2016.Time Effort Data Synthesis. SCUG 2015.","code":""},{"path":"presentations.html","id":"presentations-github","chapter":"D Presentations","heading":"D.5 GitHub","text":"Scientific Collaboration GitHub. OU Bioinformatics Breakfast Club 2015.","code":""},{"path":"presentations.html","id":"presentations-software","chapter":"D Presentations","heading":"D.6 Software","text":"REDCapR: Interaction R REDCap.OuhscMunge: Data manipulation operations commonly used Biomedical Behavioral Methodology Core within Department Pediatrics University Oklahoma Health Sciences Center.codified: Produce standard/formalized demographics tables.usnavy billets: Optimally assigning naval officers billets.","code":""},{"path":"presentations.html","id":"presentations-architecture","chapter":"D Presentations","heading":"D.7 Architectures","text":"Linear Pipeline R Analysis Skeleton\n\n.\nLinear Pipeline R Analysis Skeleton\n.\nMany--many Pipeline R Analysis Skeleton\n\n.\nMany--many Pipeline R Analysis Skeleton\n.\nImmunization transfer\n\n.\nImmunization transfer\n.\nIALSA: Collaborative Modeling Framework Multi-study Replication\n\n.\nIALSA: Collaborative Modeling Framework Multi-study Replication\n.\nPOPS: Automated daily screening eligibility rare understudied prescriptions.\n\n.\nPOPS: Automated daily screening eligibility rare understudied prescriptions.\n.\n","code":""},{"path":"presentations.html","id":"presentations-components","chapter":"D Presentations","heading":"D.8 Components","text":"Customizing display tables: using css DT kableExtra. SCUG 2018.yaml expandable trees selectively show subsets hierarchy, 2017.","code":""},{"path":"scratch-pad.html","id":"scratch-pad","chapter":"E Scratch Pad of Loose Ideas","heading":"E Scratch Pad of Loose Ideas","text":"","code":""},{"path":"scratch-pad.html","id":"chapters-sections-to-form","chapter":"E Scratch Pad of Loose Ideas","heading":"E.1 Chapters & Sections to Form","text":"Tools Consider\ntidyverse\nodbc\nTools Considertidyverseodbcggplot2\nuse factors explanatory variables want keep order consistent across graphs. (genevamarshall)\nggplot2use factors explanatory variables want keep order consistent across graphs. (genevamarshall)automation remote server VDI\n’s always chance machine configured little differently , may affect results. glance results ? forgot project , wouldn’t able spot problems like can. S drive file tables don’t seem obvious problemsautomation remote server VDIThere’s always chance machine configured little differently , may affect results. glance results ? forgot project , wouldn’t able spot problems like can. S drive file tables don’t seem obvious problemspublic reports (dashboards)\ndeveloping report external audience (ie, people outside immediate research team), choose one two pals unfamiliar aims/methods impromptu focus group. Ask things need redesigned/reframed/reformated/-explained. (genevamarshall)\nplots\nplot labels/axes\nvariable names\nunits measurement (eg, proportion vs percentage y axis)\n\npublic reports (dashboards)developing report external audience (ie, people outside immediate research team), choose one two pals unfamiliar aims/methods impromptu focus group. Ask things need redesigned/reframed/reformated/-explained. (genevamarshall)\nplots\nplot labels/axes\nvariable names\nunits measurement (eg, proportion vs percentage y axis)\nplotsplot labels/axesvariable namesunits measurement (eg, proportion vs percentage y axis)documentation - bookdown\n\nBookdown worked well us far. ’s basically independent markdown documents stored dedicated git repo. click “build” RStudio converts markdown files static html files. GitHub essentially serving backend, everyone can make changes sections don’t worried \n’s version ’s hosted publicly, tested can hosted shared file server. (’s possible html files static.) guys want OU’s collective CDW, please tell :\nwant able edit documents without review. ’ll add GitHub repo.\nwant able view documents. ’ll add dedicate file server space.\nhttps://ouhscbbmc.github.io/data-science-practices-1/workstation.html#installation-required\nthinking individual database gets chapter. BBMC ~4 databases sense: Centricity staging database, GECB staging database, central warehouse, (fledgling) downstream OMOP database. ~3 sections within chapter: () black--white description tables, columns, & indexes (written mostly consumers), (b) recommendations use table (written mostly consumers), (c) description ETL process (written mostly developers & admins).\nproposal uses GitHub Markdown ’re universal (knowledge R required –really write text editor & commit, let someone else click “build” RStudio machine). ’m flexible . ’ll support & contribute system guys feel work well across teams.\ndocumentation - bookdownBookdown worked well us far. ’s basically independent markdown documents stored dedicated git repo. click “build” RStudio converts markdown files static html files. GitHub essentially serving backend, everyone can make changes sections don’t worried aboutHere’s version ’s hosted publicly, tested can hosted shared file server. (’s possible html files static.) guys want OU’s collective CDW, please tell :want able edit documents without review. ’ll add GitHub repo.want able view documents. ’ll add dedicate file server space.https://ouhscbbmc.github.io/data-science-practices-1/workstation.html#installation-requiredI thinking individual database gets chapter. BBMC ~4 databases sense: Centricity staging database, GECB staging database, central warehouse, (fledgling) downstream OMOP database. ~3 sections within chapter: () black--white description tables, columns, & indexes (written mostly consumers), (b) recommendations use table (written mostly consumers), (c) description ETL process (written mostly developers & admins).proposal uses GitHub Markdown ’re universal (knowledge R required –really write text editor & commit, let someone else click “build” RStudio machine). ’m flexible . ’ll support & contribute system guys feel work well across teams.developing packages\nR packages Hadley Wickham\nhttp://mangothecat.github.io/goodpractice/\ndeveloping packagesR packages Hadley WickhamR packages Hadley Wickhamhttp://mangothecat.github.io/goodpractice/http://mangothecat.github.io/goodpractice/Cargo cult programming “style computer programming characterized ritual inclusion code program structures serve real purpose.” (Wikipedia)\nteam decide elements file prototype repo prototype best .Cargo cult programming “style computer programming characterized ritual inclusion code program structures serve real purpose.” (Wikipedia)team decide elements file prototype repo prototype best .","code":""},{"path":"scratch-pad.html","id":"practices","chapter":"E Scratch Pad of Loose Ideas","heading":"E.2 Practices","text":".exit() add = TRUE (Wickham (2019), Exit handlers).","code":""},{"path":"scratch-pad.html","id":"good-sites","chapter":"E Scratch Pad of Loose Ideas","heading":"E.3 Good Sites","text":"Posts sites almost always worth time reading. frequently improve develop common components used data pipelines.Yihui Xie, created knitr important contributions reproducible research.RStudio, addition IDE, many packages used created developers.Explain xkcd ’s good.Occasionally skim titles sites pick relevant interests. think helps keep aware developments field, skills continually grow approaches don’t become stagnant.O’Reilly’s Data science ideas resourcesTowards Data ScienceThese books haven’t referenced (yet), good guidance worth time skimming see relevant.Tidynomicon Dhavide Aruliah & Greg WilsonThe Tidynomicon Dhavide Aruliah & Greg WilsonEfficient R programming Colin Gillespie & Robin LovelaceEfficient R programming Colin Gillespie & Robin LovelaceMastering Software Development RMastering Software Development R","code":""},{"path":"example-dashboard.html","id":"example-dashboard","chapter":"F Example Dashboard","heading":"F Example Dashboard","text":"Communicating quantitative trends community quantitative phobia can difficult. appendix showcases dashboard style evolved past years OSDH Home Visiting, twelve local programs practitioners implemented intervention ideas tailored interests community.50 dashboards developed: custom dashboard developed program’s cycle, three additional dashboards communicate results program-agnostic investigations. style guide important tool managing many unique investigationsFor program-specific dashboard, ’s important meet needs individual PDSA conform guide. However, aim make dashboards consistent possible several reasons:’s less work practitioners. familiar presentation help practitioners grow comfortable new cycle’s dashboard. Recall use least five dashboards years.’s less work analysts/developers. Within cycle, consistent format (relatively interchangeable features) means one analyst can easily contribute trouble shoot colleague’s dashboard.lessons ’ve learned (mistakes ’ve made) can applied later dashboards. quality improve development quicken.Just like CQI grant encourages HV program learn history learn others, analysts . work programs design PDSA, one analyst learn strengths weaknesses current dashboard style, propose improvements.","code":""},{"path":"example-dashboard.html","id":"example-dashboard-example","chapter":"F Example Dashboard","heading":"F.1 Example","text":"example dashboard mimic real CQI available https://ouhscbbmc.github.io/data-science-practices-1/dashboard-1.html. dashboard source code available analysis/dashboard-1 directory R Analysis Skeleton repository’; repo contains code documents entire pipeline leading dashboard.’ve success developing distributing dashboards self-contained html files. portable don’t dependencies local data files remote databases, yet JavaScript CSS provide modest amount interactivity. dashboard’s principal components flexdashboard, plotly, ggplot2, R Markdown.dashboard synthetic data, cognitive measure tracked across 14 years three home visiting counties.","code":""},{"path":"example-dashboard.html","id":"example-dashboard-guide","chapter":"F Example Dashboard","heading":"F.2 Style Guide","text":"section describes set practices BBMC analysts decided best CQI dashboards used MIECHV evaluations. sense, CQI dashboard guide supplements overall style guide.MIECHV CQI dashboards based RStudio’s flexdashboard package, uses rmarkdown, JavaScript, CSS. flexdashboard great website read anyone adapting guide CQI projects.","code":""},{"path":"example-dashboard.html","id":"headline-page","chapter":"F Example Dashboard","heading":"F.2.1 Headline page","text":"\ndashboard’s greeting good blend () orientating user context (b) welcoming overwhelming. second PDSA cycle, try one two important impactful graphs first page; specialized graphs pages later.Left column: Text qualified {.tabset}\nNotes tab: text provides info dashboard’s dataset, \nCount () models, (b) programs, (c) clients, (d) observations\nDate range\nspecific program_codes. Even though PDSA focused specific program, ideally programs included feel others .\n\nNotes tab: text provides info dashboard’s dataset, \nCount () models, (b) programs, (c) clients, (d) observations\nDate range\nspecific program_codes. Even though PDSA focused specific program, ideally programs included feel others .\nCount () models, (b) programs, (c) clients, (d) observationsDate rangeThe specific program_codes. Even though PDSA focused specific program, ideally programs included feel others .Right column: Headline Graph(s) optionally qualified {.tabset}.\nIdeally starts overall graph, longitudinal component.\nShow data program, overall model.\nIdeally starts overall graph, longitudinal component.Show data program, overall model.","code":""},{"path":"example-dashboard.html","id":"tables-page","chapter":"F Example Dashboard","heading":"F.2.2 Tables page","text":"\ntables provide exactness, especially exactness () actual y value (b) frequency longitudinal values. tables make easier see ’re inadvertently plotting multiple values month, month missing. future, can add ‘Download CSV’ button anyone requests .Another advantage tables measures visible screen. typical program-month table least 6 columns: program_code, month, model, outcome measure, process measure, disruptor measure. difficult , upstream scribe probably isn’t job well. tables almost untouched rds files created ‘load-data’ chunk.tab represent different unit analysis (e.g., single row summarizing completed visits program-month). Use tabs appropriate PDSA. Go biggest unit (e.g., model) smallest unit (e.g., Provider-Week).Unnamed column qualified {.tabset}.\nModel tab\nProgram tab\nProgram-Month tab\nProgram-Week tab\nProvider-Week tab\nSpaghetti Annotation tab spaghetti plots use faint vertical lines mark events (e.g., start PDSA intervention), include events .\nUnnamed column qualified {.tabset}.Model tabProgram tabProgram-Month tabProgram-Week tabProvider-Week tabSpaghetti Annotation tab spaghetti plots use faint vertical lines mark events (e.g., start PDSA intervention), include events .","code":""},{"path":"example-dashboard.html","id":"graphs-page","chapter":"F Example Dashboard","heading":"F.2.3 Graphs page","text":"\ngraphs plots provide user feel trends. One graph focuses one measure, ideally max three spaghetti plots. Ideally change time (PDSA’s program) compared programs period. PSDA multiple Process Measures, give separate tabs labeled ‘Process Measure 1’ & ‘Process Measure 2’.Unnamed column qualified {.tabset}.\nOutcome Measure tab\nProcess Measure tab\nDisruptor Measure tab\nUnnamed column qualified {.tabset}.Outcome Measure tabProcess Measure tabDisruptor Measure tabIf spaghetti plot depicts proportion/percentage measure, include visual layer count/denominator behind proportion (instead separate spaghetti plot dedicated denominator). may include one following:geom_point presence/absence denotes nonzero/zero denominatorgeom_point size denotes denominator’s size.geom_text (place geom_point) explicitly shows denominator’s sizegeom_text along bottom axis explicitly shows denominator’s sizeuse spaghetti_2() located display-1.R. (yet developed.) Add hover text spaghetti.","code":""},{"path":"example-dashboard.html","id":"marginal-graphs-page","chapter":"F Example Dashboard","heading":"F.2.4 Marginal Graphs page","text":"\nmarginal histograms provide context.Single column, qualified {.tabset}.Single column, qualified {.tabset}.Contains marginal/univariate graph variables analysis.\nMarginal graph outcome measure\nMarginal graph process measure\nMarginal graph disruptor measure\nContains marginal/univariate graph variables analysis.Marginal graph outcome measureMarginal graph process measureMarginal graph disruptor measureShow data program, overall model.Show data program, overall model.Use histogram_2() located display-1.R (link accessible Oklahoma’s MIECHV evaluation team). Add hover text histogram.Use histogram_2() located display-1.R (link accessible Oklahoma’s MIECHV evaluation team). Add hover text histogram.datasets unit analysis (e.g., ‘program-month’), don’t use H3 tab. Use (H3) tabs marginals one level (e.g., visit date program-month, visit date program-week, visit date provider-week). avoid multiple levels, possible; especially program isn’t fluent single level.datasets unit analysis (e.g., ‘program-month’), don’t use H3 tab. Use (H3) tabs marginals one level (e.g., visit date program-month, visit date program-week, visit date provider-week). avoid multiple levels, possible; especially program isn’t fluent single level.histograms specific y-axis. example, “Count Months” instead “Frequency”histograms specific y-axis. example, “Count Months” instead “Frequency”","code":""},{"path":"example-dashboard.html","id":"documentation-page","chapter":"F Example Dashboard","heading":"F.2.5 Documentation page","text":"\ndocumentation self-contained html file, ’s easier practitioner quickly get explanation return trends.Sometimes ’s best place explanation/annotation right next relevant content, times ’s distracting. ’s always work maintain explanations ’re spread-across interface. let’s try keeping almost everything one two tabs Documentation page.help beyond , let’s try reuse many documentation tabs possible. first tab specific methodology displays PDSA. remaining tabs reference common Rmd files; content automatically update dashboard rendered next.Unnamed column qualified {.tabset}.\nExplanation –Current PDSA\nExplanation –CQI Dashboards\nGlossary\nTips\nConfig\nUnnamed column qualified {.tabset}.Explanation –Current PDSAExplanation –CQI DashboardsGlossaryTipsConfig","code":""},{"path":"example-dashboard.html","id":"miscellaneous-notes","chapter":"F Example Dashboard","heading":"F.2.6 Miscellaneous Notes","text":"hierarchy level outline indicates HTML-heading level. Numbers H1 (.e., ======) specify pages, roman numerals H2 (.e., ------) specify columns, letters H3 (.e., ###) specify tabs.hierarchy level outline indicates HTML-heading level. Numbers H1 (.e., ======) specify pages, roman numerals H2 (.e., ------) specify columns, letters H3 (.e., ###) specify tabs.Cosmetics connote type dashboard. Specify using theme css yaml keywords Rmd header.\nCommon measures: theme: simplex uses red banner.\n1st cycle PDSAs (.e., initial cycle MIECHV 3): theme: cosmo uses blue banner. default used theme specified.\n2nd cycle PDSAs: theme: flatly uses turquoise banner.\n3rd cycle PDSAs: theme: journal uses light red banner.\n4th cycle PDSAs (.e., initial cycle MIECHV 5): custom css purple banner (public copy css available). Instead theme, line (four leading spaces, yaml entry nested output flexdashboard::flex_dashboard)\n    css: ../../common/style-cqi-cycle-4.css\nCosmetics connote type dashboard. Specify using theme css yaml keywords Rmd header.Common measures: theme: simplex uses red banner.Common measures: theme: simplex uses red banner.1st cycle PDSAs (.e., initial cycle MIECHV 3): theme: cosmo uses blue banner. default used theme specified.1st cycle PDSAs (.e., initial cycle MIECHV 3): theme: cosmo uses blue banner. default used theme specified.2nd cycle PDSAs: theme: flatly uses turquoise banner.2nd cycle PDSAs: theme: flatly uses turquoise banner.3rd cycle PDSAs: theme: journal uses light red banner.3rd cycle PDSAs: theme: journal uses light red banner.4th cycle PDSAs (.e., initial cycle MIECHV 5): custom css purple banner (public copy css available). Instead theme, line (four leading spaces, yaml entry nested output flexdashboard::flex_dashboard)\n    css: ../../common/style-cqi-cycle-4.css4th cycle PDSAs (.e., initial cycle MIECHV 5): custom css purple banner (public copy css available). Instead theme, line (four leading spaces, yaml entry nested output flexdashboard::flex_dashboard)","code":"    css: ../../common/style-cqi-cycle-4.css"},{"path":"example-dashboard.html","id":"example-dashboard-architecture","chapter":"F Example Dashboard","heading":"F.3 Architecture","text":"dashboard one piece large workflow. design construction workflow discussed book, highlighted .\n.\n","code":""},{"path":"example-dashboard.html","id":"data-from-external-system","chapter":"F Example Dashboard","heading":"F.3.1 Data from External System","text":"","code":""},{"path":"example-dashboard.html","id":"groomed-data-in-warehouse","chapter":"F Example Dashboard","heading":"F.3.2 Groomed Data in Warehouse","text":"","code":""},{"path":"example-dashboard.html","id":"analysis-ready-dataset","chapter":"F Example Dashboard","heading":"F.3.3 Analysis-Ready Dataset","text":"little data manipulation occur dashboard. upstream scribe produce analysis-ready rds file. dashboard concerned presenting graphs, tables, summary text, documentation.little data manipulation occur dashboard. upstream scribe produce analysis-ready rds file. dashboard concerned presenting graphs, tables, summary text, documentation.Include common measure PDSA explicitly mentions . Try show measures ’re directly related PDSA. PDSA dashboard less exposure change (makes easier maintain). program needs context measures, can look common measure dashboard.Include common measure PDSA explicitly mentions . Try show measures ’re directly related PDSA. PDSA dashboard less exposure change (makes easier maintain). program needs context measures, can look common measure dashboard.","code":""},{"path":"example-chapter.html","id":"example-chapter","chapter":"G Example Chapter","heading":"G Example Chapter","text":"intro copied 1st chapter example bookdown repo. ’m keeping temporarily reference.can label chapter section titles using {#label} , e.g., can reference Intro Chapter. manually label , automatic labels anywayFigures tables captions placed figure table environments, respectively.\nFigure G.1: nice figure!\nReference figure code chunk label fig: prefix, e.g., see Figure G.1. Similarly, can reference tables generated knitr::kable(), e.g., see Table G.1.Table G.1: nice table!can write citations, . example, using bookdown package (Xie 2023) sample book, built top R Markdown knitr (Xie 2015).","code":"\npar(mar = c(4, 4, .1, .1))\nplot(pressure, type = 'b', pch = 19)\nknitr::kable(\n  head(iris, 20), caption = 'Here is a nice table!',\n  booktabs = TRUE\n)"},{"path":"acknowledgements.html","id":"acknowledgements","chapter":"H Acknowledgements","heading":"H Acknowledgements","text":"authors thank colleagues discussions experiences data science lead book. OUHSC, includes\n@adrose,\n@aggie-dbc,\n@ARPeters,\n@Ashley-Jorgensen,\n@athumann,\n@atreat1,\n@caston60,\n@chanukyalakamsani,\n@CWilliamsOUHSC,\n@DavidBard,\n@evoss1,\n@genevamarshall,\n@Maleeha,\n@man9472,\n@rmatkins,\n@sbohora,\n@thomasnwilson,\n@vimleshbavadiya,\n@waleboro,\n@YuiYamaoka,\n@yutiantang.Outside OUHSC, includes@andkov,\n@ben519,\n@cscherrer,\n@cmodzelewski,\n@jimquallen,\n@mhunter1,\n@probinso,\n@russelljonas, \n@spopovych.`r (knitr::is_html_output()) ’","code":""},{"path":"references.html","id":"references","chapter":"I References","heading":"I References","text":"","code":""}]
+[{"path":"index.html","id":"intro","chapter":"1 Introduction","heading":"1 Introduction","text":"collection documents describe practices used OUHSC BBMC analytics projects.","code":""},{"path":"coding.html","id":"coding","chapter":"2 Coding Principles","heading":"2 Coding Principles","text":"","code":""},{"path":"coding.html","id":"coding-simplify","chapter":"2 Coding Principles","heading":"2.1 Simplify","text":"","code":""},{"path":"coding.html","id":"coding-simplify-types","chapter":"2 Coding Principles","heading":"2.1.1 Data Types","text":"Use simplest data type reasonable. simpler data type less likely contain unintended values. seen, string variable called gender can simultaneously contain values “m”, “f”, “F”, “Female”, “MALE”, “0”, “1”, “2”, “Latino”, ““, NA. hand, boolean variable gender_male can FALSE, TRUE, NA.1SQLite dedicated datatype, must resort storing 0, 1 NULL values. caller can’t assume ostensible boolean SQLite variable contains three values, variable checked.cleaned variable initial ETL files (like Ellis), establish boundaries spend time downstream files verifying bad values introduced. small bonus, simpler data types typically faster, consume less memory, translate cleanly across platforms.Within R, numeric-ish variables can represented following four data types. Use simplest type adequately captures information. logical simplest numeric flexible.logical/boolean/bit,integer,bit64::integer64, andnumeric/double-precision floats.Categorical variables similar spectrum. logical types, factors restrictive less flexible characters.2logical/boolean/bit,factor, andcharacter.","code":""},{"path":"coding.html","id":"coding-simplify-categorical","chapter":"2 Coding Principles","heading":"2.1.2 Categorical Levels","text":"boolean variable restrictive factor character required, choose simplest representation. possible:Use lower case (e.g., ‘male’ instead ‘Male’ gender variable).avoid repeating variable level (e.g., ‘control’ instead ‘control condition’ condition variable).","code":""},{"path":"coding.html","id":"coding-simplify-recoding","chapter":"2 Coding Principles","heading":"2.1.3 Recoding","text":"Almost every project recodes variables. Choose simplest function possible. functions top easier read harder mess functions itLeverage existing booleans: Suppose logical variable gender_male (can TRUE, FALSE, NA). Writing gender_male == TRUE gender_male == FALSE evaluate boolean –’s unnecessary gender_male already boolean.\nTesting TRUE: use variable (.e., gender_male instead gender_male == TRUE).\nTesting FALSE: use !. Write !gender_male instead gender_male == FALSE gender_male != TRUE.\nLeverage existing booleans: Suppose logical variable gender_male (can TRUE, FALSE, NA). Writing gender_male == TRUE gender_male == FALSE evaluate boolean –’s unnecessary gender_male already boolean.Testing TRUE: use variable (.e., gender_male instead gender_male == TRUE).Testing TRUE: use variable (.e., gender_male instead gender_male == TRUE).Testing FALSE: use !. Write !gender_male instead gender_male == FALSE gender_male != TRUE.Testing FALSE: use !. Write !gender_male instead gender_male == FALSE gender_male != TRUE.dplyr::coalesce(): function evaluates single variable replaces NA values another variable.\ncoalesce like\n\nvisit_completed <- dplyr::coalesce(visit_completed, FALSE)\nmuch easier read mess \n\nvisit_completed <- dplyr::if_else(!.na(visit_completed), visit_completed, FALSE)dplyr::coalesce(): function evaluates single variable replaces NA values another variable.coalesce likeis much easier read mess thandplyr::na_if() transforms nonmissing value NA.\nRecoding missing values like\n\nbirth_apgar <- dplyr::na_if(birth_apgar, 99)\neasier read mess \n\nbirth_apgar <- dplyr::if_else(birth_apgar == 99, NA_real_, birth_apgar)dplyr::na_if() transforms nonmissing value NA.Recoding missing values likeis easier read mess <= (similar comparison operator): Compare two quantities output boolean variable. parentheses unnecessary, can help readability. either value NA, result NA.\nNotice prefer order variables like number line. result TRUE, smaller value left larger value.\n\ndob_in_the_future   <- (Sys.Date() < dob)\ndod_follows_dob     <- (dob <= dod)\npremature           <- (gestation_weeks < 37)\nbig_boy             <- (threshold_in_kg <= birth_weight_in_kg)<= (similar comparison operator): Compare two quantities output boolean variable. parentheses unnecessary, can help readability. either value NA, result NA.Notice prefer order variables like number line. result TRUE, smaller value left larger value.dplyr::if_else(): function evaluates single boolean variable expression. output branches three possibilities: input () true, (b) false, (c) (optionally) NA. Notice unlike <= operator, dplyr::if_else() lets specify value input expression evaluates NA.\n\ndate_start  <- .Date(\"2017-01-01\")\n\n# missing month element needs handled explicitly.\nstage       <- dplyr::if_else(date_start <= month, \"post\", \"pre\", missing = \"missing-month\")\n\n# Otherwise simple boolean output sufficient.\nstage_post  <- (date_start <= month)\nimportant reader understand input expression NA produce NA, consider using dplyr::if_else(). Even though two lines equivalent, casual reader may consider stage_post NA.\n\nstage_post  <- (date_start <= month)\nstage_post  <- dplyr::if_else(date_start <= month, TRUE, FALSE, missing = NA)dplyr::if_else(): function evaluates single boolean variable expression. output branches three possibilities: input () true, (b) false, (c) (optionally) NA. Notice unlike <= operator, dplyr::if_else() lets specify value input expression evaluates NA.important reader understand input expression NA produce NA, consider using dplyr::if_else(). Even though two lines equivalent, casual reader may consider stage_post NA.dplyr::(): function evaluates numeric x left right boundary return boolean value. output TRUE x inside boundaries equal either boundary (.e., boundaries inclusive). output FALSE x outside either boundary.\n\ntoo_cold      <- 60\ntoo_hot       <- 88\ngoldilocks_1  <- dplyr::(temperature, too_cold, too_hot)\n\n# equivalent previous line.\ngoldilocks_2  <- (too_cold <= temperature & temperature <= too_hot)\nneed exclusive boundary, abandon dplyr::() specify exactly.\n\n# Left boundary exclusive\ngoldilocks_3  <- (too_cold < temperature & temperature <= too_hot)\n\n# boundaries exclusive\ngoldilocks_4  <- (too_cold < temperature & temperature <  too_hot)\ncode starts nest dplyr::() calls inside dplyr::if_else(), consider base::cut().dplyr::(): function evaluates numeric x left right boundary return boolean value. output TRUE x inside boundaries equal either boundary (.e., boundaries inclusive). output FALSE x outside either boundary.need exclusive boundary, abandon dplyr::() specify exactly.code starts nest dplyr::() calls inside dplyr::if_else(), consider base::cut().base::cut(): function transforms single numeric variable factor. range cut different segments/categories one-dimensional number line. output branches single discrete value (either factor-level integer). Modify right parameter FALSE ’d like left/lower bound inclusive (tends natural ).\n\nmtcars |>\n  tibble::as_tibble() |>\n  dplyr::select(\n    disp,\n  ) |>\n  dplyr::mutate(\n    # Example simple inequality operator (see two bullets )\n    muscle_car            = (300 <= disp),\n\n    # Divide `disp` three levels.\n    size_default_labels   = cut(disp, breaks = c(-Inf, 200, 300, Inf), right = F),\n\n    # Divide `disp` three levels custom labels.\n    size_cut3             = cut(\n      disp,\n      breaks = c(-Inf,   200,      300,   Inf),\n      labels = c(  \"small\", \"medium\", \"big\"),\n      right = FALSE  # right boundary INclusive ('FALSE' EXclusive boundary)\n    ),\n\n    # Divide `disp` five levels custom labels.\n    size_cut5             = cut(\n      disp,\n      breaks = c(-Inf,         100,            150,            200,      300,   Inf),\n      labels = c(  \"small small\", \"medium small\", \"biggie small\", \"medium\", \"big\"),\n      right = FALSE\n    ),\n  )base::cut(): function transforms single numeric variable factor. range cut different segments/categories one-dimensional number line. output branches single discrete value (either factor-level integer). Modify right parameter FALSE ’d like left/lower bound inclusive (tends natural ).dplyr::recode(): function accepts integer character variable. output branches single discrete value. example maps integers strings.\n\n# https://www.census.gov/quickfacts/fact/note/US/RHI625219\nrace_id        <- c(1L, 2L, 1L, 4L, 3L, 4L, 2L, NA_integer_)\nrace_id_spouse <- c(1L, 1L, 2L, 3L, 3L, 4L, 5L, NA_integer_)\nrace <-\n  dplyr::recode(\n    race_id,\n    \"1\"      = \"White\",\n    \"2\"      = \"Black African American\",\n    \"3\"      = \"American Indian Alaska Native\",\n    \"4\"      = \"Asian\",\n    \"5\"      = \"Native Hawaiian Pacific Islander\",\n    .missing = \"Unknown\"\n  )\nmultiple variables mapping, define mapping named vector, pass multiple calls dplyr::recode(). Notice two variables race race_spouse use mapping.3\n\nmapping_race <- c(\n  \"1\" = \"White\",\n  \"2\" = \"Black African American\",\n  \"3\" = \"American Indian Alaska Native\",\n  \"4\" = \"Asian\",\n  \"5\" = \"Native Hawaiian Pacific Islander\"\n)\nrace <-\n  dplyr::recode(\n    race_id,\n    !!!mapping_race,\n    .missing = \"Unknown\"\n  )\nrace_spouse <-\n  dplyr::recode(\n    race_id_spouse,\n    !!!mapping_race,\n    .missing = \"Unknown\"\n  )\nTips dplyr::recode():\nreusable dedicated mapping vector useful surveys 10+ Likert items consistent levels like “disagree”, “neutral”, “agree”.\nUse dplyr::recode_factor() map integers factor levels.\nforcats::fct_recode() similar. prefer .missing parameter dplyr::recode() translates NA explicit value.\nusing REDCap API, functions help convert radio buttons character factor variable.\ndplyr::recode(): function accepts integer character variable. output branches single discrete value. example maps integers strings.multiple variables mapping, define mapping named vector, pass multiple calls dplyr::recode(). Notice two variables race race_spouse use mapping.3Tips dplyr::recode():reusable dedicated mapping vector useful surveys 10+ Likert items consistent levels like “disagree”, “neutral”, “agree”.Use dplyr::recode_factor() map integers factor levels.forcats::fct_recode() similar. prefer .missing parameter dplyr::recode() translates NA explicit value.using REDCap API, functions help convert radio buttons character factor variable.lookup table: feasible recode 6 levels race directly R, ’s less feasible recode 200 provider names. Specify mapping csv, use readr convert csv data.frame, finally left join .lookup table: feasible recode 6 levels race directly R, ’s less feasible recode 200 provider names. Specify mapping csv, use readr convert csv data.frame, finally left join .dplyr::case_when(): function complicated can evaluate multiple input variables. Also, multiple cases can true, first output returned. ‘water fall’ execution helps complicated scenarios, overkill .dplyr::case_when(): function complicated can evaluate multiple input variables. Also, multiple cases can true, first output returned. ‘water fall’ execution helps complicated scenarios, overkill .","code":"\nvisit_completed <- dplyr::coalesce(visit_completed, FALSE)\nvisit_completed <- dplyr::if_else(!is.na(visit_completed), visit_completed, FALSE)\nbirth_apgar <- dplyr::na_if(birth_apgar, 99)\nbirth_apgar <- dplyr::if_else(birth_apgar == 99, NA_real_, birth_apgar)\ndob_in_the_future   <- (Sys.Date() < dob)\ndod_follows_dob     <- (dob <= dod)\npremature           <- (gestation_weeks < 37)\nbig_boy             <- (threshold_in_kg <= birth_weight_in_kg)\ndate_start  <- as.Date(\"2017-01-01\")\n\n# If a missing month element needs to be handled explicitly.\nstage       <- dplyr::if_else(date_start <= month, \"post\", \"pre\", missing = \"missing-month\")\n\n# Otherwise a simple boolean output is sufficient.\nstage_post  <- (date_start <= month)\nstage_post  <- (date_start <= month)\nstage_post  <- dplyr::if_else(date_start <= month, TRUE, FALSE, missing = NA)\ntoo_cold      <- 60\ntoo_hot       <- 88\ngoldilocks_1  <- dplyr::between(temperature, too_cold, too_hot)\n\n# This is equivalent to the previous line.\ngoldilocks_2  <- (too_cold <= temperature & temperature <= too_hot)\n# Left boundary is exclusive\ngoldilocks_3  <- (too_cold < temperature & temperature <= too_hot)\n\n# Both boundaries are exclusive\ngoldilocks_4  <- (too_cold < temperature & temperature <  too_hot)\nmtcars |>\n  tibble::as_tibble() |>\n  dplyr::select(\n    disp,\n  ) |>\n  dplyr::mutate(\n    # Example of a simple inequality operator (see two bullets above)\n    muscle_car            = (300 <= disp),\n\n    # Divide `disp` into three levels.\n    size_default_labels   = cut(disp, breaks = c(-Inf, 200, 300, Inf), right = F),\n\n    # Divide `disp` into three levels with custom labels.\n    size_cut3             = cut(\n      disp,\n      breaks = c(-Inf,   200,      300,   Inf),\n      labels = c(  \"small\", \"medium\", \"big\"),\n      right = FALSE  # Is the right boundary INclusive ('FALSE' is an EXclusive boundary)\n    ),\n\n    # Divide `disp` into five levels with custom labels.\n    size_cut5             = cut(\n      disp,\n      breaks = c(-Inf,         100,            150,            200,      300,   Inf),\n      labels = c(  \"small small\", \"medium small\", \"biggie small\", \"medium\", \"big\"),\n      right = FALSE\n    ),\n  )\n# https://www.census.gov/quickfacts/fact/note/US/RHI625219\nrace_id        <- c(1L, 2L, 1L, 4L, 3L, 4L, 2L, NA_integer_)\nrace_id_spouse <- c(1L, 1L, 2L, 3L, 3L, 4L, 5L, NA_integer_)\nrace <-\n  dplyr::recode(\n    race_id,\n    \"1\"      = \"White\",\n    \"2\"      = \"Black or African American\",\n    \"3\"      = \"American Indian and Alaska Native\",\n    \"4\"      = \"Asian\",\n    \"5\"      = \"Native Hawaiian or Other Pacific Islander\",\n    .missing = \"Unknown\"\n  )\nmapping_race <- c(\n  \"1\" = \"White\",\n  \"2\" = \"Black or African American\",\n  \"3\" = \"American Indian and Alaska Native\",\n  \"4\" = \"Asian\",\n  \"5\" = \"Native Hawaiian or Other Pacific Islander\"\n)\nrace <-\n  dplyr::recode(\n    race_id,\n    !!!mapping_race,\n    .missing = \"Unknown\"\n  )\nrace_spouse <-\n  dplyr::recode(\n    race_id_spouse,\n    !!!mapping_race,\n    .missing = \"Unknown\"\n  )"},{"path":"coding.html","id":"coding-defensive","chapter":"2 Coding Principles","heading":"2.2 Defensive Style","text":"","code":""},{"path":"coding.html","id":"coding-defensive-qualify-functions","chapter":"2 Coding Principles","heading":"2.2.1 Qualify functions","text":"Try prepend function package. Write dplyr::filter() instead filter(). two packages contain public functions name, package recently called library() takes precedent. multiple R files executed, packages’ precedents may predictable. Specifying package eliminates ambiguity, also making code easier follow. reason, recommend almost R files contain ‘load-packages’ chunk.See Google Style Guide qualifying functions.exceptions exist, including:sf package ’re using objects dplyr verbs.","code":""},{"path":"coding.html","id":"coding-defensive-date-arithmetic","chapter":"2 Coding Principles","heading":"2.2.2 Date Arithmetic","text":"Don’t use minus operator (.e., -) subtract dates. Instead use .integer(difftime(stop, start, units=\"days\")). ’s longer protects scenario start stop changed upstream date datetime. case, stop - start returns number seconds two points, number days.","code":""},{"path":"coding.html","id":"excluding-bad-cases","chapter":"2 Coding Principles","heading":"2.2.3 Excluding Bad Cases","text":"variables critical record, ’s missing, don’t want trust values. instance, hospital visit record rarely useful null patient ID. cases, prevent record passing ellis.example, ’ll presume trust patient record lacks clean date birth (dob).Define permissible range, either ellis’s declare-globals chunk, config-file. (’ll use config file example.) ’ll exclude anyone born 2000, tomorrow. Even though ’s illogical someone retrospective record born tomorrow, consider bending little small errors.\nrange_dob   : !expr c(.Date(\"2000-01-01\"), Sys.Date() + lubridate::days(1))Define permissible range, either ellis’s declare-globals chunk, config-file. (’ll use config file example.) ’ll exclude anyone born 2000, tomorrow. Even though ’s illogical someone retrospective record born tomorrow, consider bending little small errors.tweak-data chunk, use OuhscMunge::trim_date() set cell NA falls outside acceptable range. dplyr::mutate(), call tidyr::drop_na() exclude entire record, regardless () already NA, (b) “trimmed” NA.\n\nds <-\n  ds |>\n  dplyr::mutate(\n    dob = OuhscMunge::trim_date(dob, config$range_dob)\n  ) |>\n  tidyr::drop_na(dob)tweak-data chunk, use OuhscMunge::trim_date() set cell NA falls outside acceptable range. dplyr::mutate(), call tidyr::drop_na() exclude entire record, regardless () already NA, (b) “trimmed” NA.Even though ’s overkill trimming, (eventually) verify variable three reasons: () ’s chance code isn’t working expected, (b) later code might introduced bad values, (c) clearly documents reader dob included range stage pipeline.\n\ncheckmate::assert_date(ds$dob, .missing=F, lower=config$range_dob[1], upper=config$range_dob[2])Even though ’s overkill trimming, (eventually) verify variable three reasons: () ’s chance code isn’t working expected, (b) later code might introduced bad values, (c) clearly documents reader dob included range stage pipeline.","code":"range_dob   : !expr c(as.Date(\"2000-01-01\"), Sys.Date() + lubridate::days(1))\nds <-\n  ds |>\n  dplyr::mutate(\n    dob = OuhscMunge::trim_date(dob, config$range_dob)\n  ) |>\n  tidyr::drop_na(dob)\ncheckmate::assert_date(ds$dob, any.missing=F, lower=config$range_dob[1], upper=config$range_dob[2])"},{"path":"coding.html","id":"throw-errors-for-bad-cells","chapter":"2 Coding Principles","heading":"2.2.4 Throw errors for bad cells","text":"checkmate::assert_*() functions throw error stop R’s execution encountering vector violates constraints specified. previous snippet alert ifds$dob date,ds$dob least one NA value, ords$dob value earlier config$range_dob[1] later config$range_dob[2].package family functions accommodate many types vectors. common conditions verify :vector’s values unique, arises ’re upload primary key database (e.g., patient ID patient table),\n\ncheckmate::assert_integer(ds$pt_id, unique = TRUE)vector’s values unique, arises ’re upload primary key database (e.g., patient ID patient table),vector’s string follow strict pattern (e.g., patient ID “” “B”, followed 4 digits)\n\ncheckmate::assert_character(ds$pt_id, pattern = \"^[AB]\\\\d{4}$\")vector’s string follow strict pattern (e.g., patient ID “” “B”, followed 4 digits)database doesn’t accept names longer 50 characters\n\ncheckmate::assert_character(ds$name_first, min.chars = 50)\n# \ncheckmate::assert_character(ds$name_first, pattern = \"^.{0,50}$\")database doesn’t accept names longer 50 charactersThe pattern argument ultimately passed base::grepl(), leverage regular expressions.","code":"\ncheckmate::assert_integer(ds$pt_id, unique = TRUE)\ncheckmate::assert_character(ds$pt_id, pattern = \"^[AB]\\\\d{4}$\")\ncheckmate::assert_character(ds$name_first, min.chars = 50)\n# or\ncheckmate::assert_character(ds$name_first, pattern = \"^.{0,50}$\")"},{"path":"coding.html","id":"throw-errors-for-bad-conditions","chapter":"2 Coding Principles","heading":"2.2.5 Throw errors for bad conditions","text":"Sometimes dataset smells fishy even though single cell violates constraint. Send flare ’s kinda bad, yet stop execution really stinks.especially important recurring scripts process new datasets never inspected human, daily forecast. Even though today’s incoming dataset fine, shouldn’t trust next month’s. worst, lonely test never catches violation (wasted 5 minutes). best, catches problem proceeded undetected compromised downstream analyses.following snippet asserts ’s acceptable 2% patients missing age, never get worse 5%. Therefore throws error missingness exceeds 5% throws warning exceeds 2%.","code":"\n# Simulate a vector of ages.\nds <- tibble::tibble(\n  age = sample(c(NA, 1:19), size = 100, replace = TRUE)\n)\n\n# Define thresholds for errors & warnings.\nthreshold_error     <- .05\nthreshold_warning   <- .02\n\n# Calculate proportion of missing cells.\nmissing_proportion  <- mean(is.na(ds$age))\n\n# Accompany the error/warning with an informative message.\nif (threshold_error < missing_proportion) {\n  stop(\n    \"The proportion of missing `age` values is \", missing_proportion,\n    \", but it shouldn't exceed \", threshold_error, \".\"\n  )\n} else if (threshold_warning < missing_proportion) {\n  warning(\n    \"The proportion of missing `age` values is \", missing_proportion,\n    \", but ideally it stays below \", threshold_warning, \".\"\n  )\n}"},{"path":"architecture.html","id":"architecture","chapter":"3 Architecture Principles","heading":"3 Architecture Principles","text":"","code":""},{"path":"architecture.html","id":"encapsulation","chapter":"3 Architecture Principles","heading":"3.1 Encapsulation","text":"","code":""},{"path":"architecture.html","id":"leverage-team-members-strengths-avoid-weaknesses","chapter":"3 Architecture Principles","heading":"3.2 Leverage team member’s strengths & avoid weaknesses","text":"","code":""},{"path":"architecture.html","id":"focused-code-files","chapter":"3 Architecture Principles","heading":"3.2.1 Focused code files","text":"","code":""},{"path":"architecture.html","id":"metadata-for-content-experts","chapter":"3 Architecture Principles","heading":"3.2.2 Metadata for content experts","text":"","code":""},{"path":"architecture.html","id":"scales","chapter":"3 Architecture Principles","heading":"3.3 Scales","text":"","code":""},{"path":"architecture.html","id":"single-source-single-analysis","chapter":"3 Architecture Principles","heading":"3.3.1 Single source & single analysis","text":"","code":""},{"path":"architecture.html","id":"multiple-sources-multiple-analyses","chapter":"3 Architecture Principles","heading":"3.3.2 Multiple sources & multiple analyses","text":"","code":""},{"path":"architecture.html","id":"architecture-consistency","chapter":"3 Architecture Principles","heading":"3.4 Consistency","text":"","code":""},{"path":"architecture.html","id":"consistency-files","chapter":"3 Architecture Principles","heading":"3.4.1 Across Files","text":"","code":""},{"path":"architecture.html","id":"across-languages","chapter":"3 Architecture Principles","heading":"3.4.2 Across Languages","text":"","code":""},{"path":"architecture.html","id":"across-projects","chapter":"3 Architecture Principles","heading":"3.4.3 Across Projects","text":"","code":""},{"path":"prototype-r.html","id":"prototype-r","chapter":"4 Prototypical R File","heading":"4 Prototypical R File","text":"stated Consistency across Files, using consistent file structure can () improve quality code structure proven time facilitate good practices (b) allow intentions clear teammates familiar order intentions chunks.use term “chunk” section code corresponds knitr terminology (Xie 2015), many analysis files (opposed manipulation files), chunk R file connects knitr Rmd file.","code":""},{"path":"prototype-r.html","id":"chunk-clear","chapter":"4 Prototypical R File","heading":"4.1 Clear Memory","text":"initial chunk many files clear memory variables previous run. important developing debugging prevents previous runs contaminating subsequent runs. However little effect production; ’ll look manipulation files separately analysis files.Manipulation R files sourced argument local=new.env(). file executed fresh environment, variables clear. Analysis R files typically called Rmd file’s knitr::read_chunk(), code positioned first chunk called knitr 4.However typically clear memory R files sourced environment caller, interfere caller’s variables.","code":"\nrm(list = ls(all.names = TRUE))"},{"path":"prototype-r.html","id":"chunk-load-sources","chapter":"4 Prototypical R File","heading":"4.2 Load Sources","text":"first true chunk, source R files containing global variables functions current file requires. instance, team statisticians producing large report containing many analysis files, define many graphical elements single file. sourced file defines common color palettes graphical functions cosmetics uniform across analyses.prefer sourced files perform real action, importing data manipulating file. One reason difficult consistent environmental variables sourced file’s functions run. second reason cognitively difficult understand files connected.sourced file contains function definitions, operations can called time current file much tighter control variables modified. bonus discipline defining functions (instead executing functions) operations typically robust generalizable.Keep chunk even files sourced. empty chunk instructive readers trying determine files sourced. applies recommendation applies chunks discussed chapter. always, team agree set standards.","code":"\n# ---- load-sources ------------------------------------------------------------\nbase::source(file=\"./analysis/common/display-1.R\")      # Load common graphing functions."},{"path":"prototype-r.html","id":"chunk-load-packages","chapter":"4 Prototypical R File","heading":"4.3 Load Packages","text":"‘load-packages’ chunk declares required packages near file’s beginning three reasons. First, reader scanning file can quickly determine dependencies located single chunk. Second, machine lacking required package, best know early5. Third, style mimics requirement languages (declaring headers top C++ file) follows tidyverse style guide.discussed previous qualify functions section, recommend functions qualified package (e.g., foo::bar() instead merely bar()). Consequently, ‘load-packages’ chunk calls requireNamespace() frequently library(). requireNamespace() verifies package available local machine, load memory; library() verifies package available, loads .requireNamespace() used several scenarios.Core packages (e.g., ‘base’ ‘stats’) loaded R default installations. avoid unnecessary calls like library(stats) distract important features.Obvious dependencies called requireNamespace() library() similar reasons, especially called directly. example ‘tidyselect’ listed ‘tidyr’ listed.using version older R 4.16: “pipe” function (declared ‘magrittr’ package , .e., %>%) attached import::(magrittr, \"%>%\"). frequently-used function can called throughout execution without qualification.Compared manipulation files, analysis files tend use many functions concentrated packages conflicting function names less common. Typical packages used analysis ‘ggplot2’ ‘lme4’.sourced files may load packages (calling library()). important library() calls file follow ‘load-sources’ chunk identically-named functions (different packages) called correct precedent. Otherwise identically-named functions conflict namespace hard--predict results.Read R Packages library(), requireNamespace(), siblings, well larger concepts attaching functions search path.packages found manipulation files. Notice lesser-known packages quick explanation; helps maintainers decide declaration still necessary. Also notice packages distributed outside CRAN (e.g., GitHub) quick commented line help user install update package.","code":"\n# ---- load-packages -----------------------------------------------------------\n# import::from(magrittr, \"%>%\" )\n\nrequireNamespace(\"readr\"     )\nrequireNamespace(\"tidyr\"     )\nrequireNamespace(\"dplyr\"     )\nrequireNamespace(\"config\"    )\nrequireNamespace(\"checkmate\" ) # Asserts expected conditions\nrequireNamespace(\"OuhscMunge\") # remotes::install_github(repo=\"OuhscBbmc/OuhscMunge\")"},{"path":"prototype-r.html","id":"chunk-declare","chapter":"4 Prototypical R File","heading":"4.4 Declare Globals","text":"values repeatedly used within file, consider dedicating variable ’s defined set . also good place variables used , whose value central file’s mission. Typical variables ‘declare-globals’ chunk include data file paths, data file variables, color palettes, values config file.config file can coordinate static variable across multiple files. Centrally","code":"\n# ---- declare-globals ---------------------------------------------------------\n# Constant values that won't change.\nconfig                         <- config::get()\npath_db                        <- config$path_database\n\n# Execute to specify the column types.  It might require some manual adjustment (eg doubles to integers).\n#   OuhscMunge::readr_spec_aligned(config$path_subject_1_raw)\ncol_types <- readr::cols_only(\n  subject_id          = readr::col_integer(),\n  county_id           = readr::col_integer(),\n  gender_id           = readr::col_double(),\n  race                = readr::col_character(),\n  ethnicity           = readr::col_character()\n)"},{"path":"prototype-r.html","id":"chunk-load-data","chapter":"4 Prototypical R File","heading":"4.5 Load Data","text":"data ingested file occurs chunk. like think file linear pipe single point input single point output. Although possible file read data files line, recommend avoiding sprawl difficult humans understand. software developer deist watchmaker, file’s fate sealed end chunk. makes easier human reason isolate problems either existing () incoming data (b) calculations data.Ideally chunk consumes data either plain-text csv database.Many capable R functions packages ingest data. prefer tidyverse readr reading conventional files; younger cousin, vroom nice advantages working larger files forms jagged rectangles7. Depending file format, good packages consider data.table, haven, readxl, openxlsx, arrow, jsonlite, fst, yaml, rio.used Ellis, chunk likely consumes flat file like csv data metadata. used Ferry, Arch, Scribe, chunk likely consumes database table. used Analysis file, chunk likely consumes database table rds (.e., compressed R data file).large-scale scenarios, may series datasets held RAM simultaneously. first choice split R file new file subset datasets –words, R file probably given much responsibility. Occassionaly multiple datasets need considered , splitting R file option. scenarios, prefer upload datasets database, better manipulating datasets large RAM.R solution may loosen restriction dataset enter R file ‘load-data’ chunk. dataset processed longer needed, rm() removes RAM. Now another dataset can read file manipulated.loose scrap:\nchunk reads data (e.g., database table, networked CSV, local lookup table). chunk, new data introduced. sake reducing human cognition load. Everything chunk derived first four chunks.","code":""},{"path":"prototype-r.html","id":"chunk-tweak-data","chapter":"4 Prototypical R File","heading":"4.6 Tweak Data","text":"loose scrap:\n’s best rename dataset () single place (b) early pipeline, bad variable never referenced.","code":"\n# OuhscMunge::column_rename_headstart(ds) # Help write `dplyr::select()` call.\nds <-\n  ds |>\n  dplyr::select(    # `dplyr::select()` drops columns not included.\n    subject_id,\n    county_id,\n    gender_id,\n    race,\n    ethnicity\n  ) |>\n  dplyr::mutate(\n\n  ) |>\n  dplyr::arrange(subject_id) # |>\n  # tibble::rowid_to_column(\"subject_id\") # Add a unique index if necessary"},{"path":"prototype-r.html","id":"chunk-unique","chapter":"4 Prototypical R File","heading":"4.7 (Unique Content)","text":"section represents chunks tweak-data verify-values. chunks contain file’s creativity contribution. sense, structure first last chunks allow middle chunks focus concepts instead plumbing.simple files like ellis metadata file, may even need anything . complex analysis files may 200+ lines distributed across dozen chunks. recommend create dedicate chunk conceptual stage. one starts contain ~20 lines, consider granular organization clarify code’s intent.","code":""},{"path":"prototype-r.html","id":"chunk-verify-values","chapter":"4 Prototypical R File","heading":"4.8 Verify Values","text":"Running OuhscMunge::verify_value_headstart(ds) ","code":"\n# ---- verify-values -----------------------------------------------------------\n# Sniff out problems\n# OuhscMunge::verify_value_headstart(ds)\ncheckmate::assert_integer(  ds$county_month_id    , any.missing=F , lower=1, upper=3080                , unique=T)\ncheckmate::assert_integer(  ds$county_id          , any.missing=F , lower=1, upper=77                            )\ncheckmate::assert_date(     ds$month              , any.missing=F , lower=as.Date(\"2012-06-15\"), upper=Sys.Date())\ncheckmate::assert_character(ds$county_name        , any.missing=F , pattern=\"^.{3,12}$\"                          )\ncheckmate::assert_integer(  ds$region_id          , any.missing=F , lower=1, upper=20                            )\ncheckmate::assert_numeric(  ds$fte                , any.missing=F , lower=0, upper=40                            )\ncheckmate::assert_logical(  ds$fte_approximated   , any.missing=F                                                )\ncheckmate::assert_numeric(  ds$fte_rolling_median , any.missing=T , lower=0, upper=40                            )\n\ncounty_month_combo   <- paste(ds$county_id, ds$month)\ncheckmate::assert_character(county_month_combo, pattern  =\"^\\\\d{1,2} \\\\d{4}-\\\\d{2}-\\\\d{2}$\", any.missing=F, unique=T)"},{"path":"prototype-r.html","id":"chunk-specify-columns","chapter":"4 Prototypical R File","heading":"4.9 Specify Output Columns","text":"chunk:verifies variables exist uploading,documents (troubleshooting developers) variables product file, andreorders variables match expected structure.Variable order especially important database engines/drivers ignore variable name, use variable position.use term ‘slim’ typically output fewer variables full dataset processed file.doubt variable needed downstream, leave dplyr::select(), commented . someone needs future, ’ll easily determine might come , uncomment line (possibly modify database table). import column warehouse multiple people using, can tough remove without breaking code.chunk follows verify-values sometimes want check validity variables consumed downstream. variables important , illegal value may reveal larger problem dataset.","code":"\n# Print colnames that `dplyr::select()`  should contain below:\n#   cat(paste0(\"    \", colnames(ds), collapse=\",\\n\"))\n\n# Define the subset of columns that will be needed in the analyses.\n#   The fewer columns that are exported, the fewer things that can break downstream.\n\nds_slim <-\n  ds |>\n  # dplyr::slice(1:100) |>\n  dplyr::select(\n    subject_id,\n    county_id,\n    gender_id,\n    race,\n    ethnicity\n  )\n\nds_slim"},{"path":"prototype-r.html","id":"save-to-disk-or-database","chapter":"4 Prototypical R File","heading":"4.10 Save to Disk or Database","text":"","code":""},{"path":"prototype-r.html","id":"additional-resources","chapter":"4 Prototypical R File","heading":"4.11 Additional Resources","text":"(Colin Gillespie 2017), particularly “Efficient input/output” chapter.","code":""},{"path":"prototype-sql.html","id":"prototype-sql","chapter":"5 Prototypical SQL File","heading":"5 Prototypical SQL File","text":"New data scientists typically import entire tables database R, merge, filter, groom data.frames. efficient approach submit sql executes database returns specialized dataset.provides several advantages:database much efficient filtering joining tables programing language, R Python. well-designed database indexed columns optimizations surpass R Python capabilities.database handles datasets thousands times larger R Python can accommodate RAM. large datasets, database engines persist data hard drive (instead just RAM) optimized read necessary information RAM moment needed, return processed back disk progressing next block data.Frequently, portion table’s rows columns ultimately needed analysis. Reducing size dataset leaving database two benefits: less information travels across network R’s Python’s limited memory space conserved.scenarios, desirable use INSERT SQL command transfer data within database; never travel across network never touch R local machine. large complicated projects, majority data movement uses INSERT commands within SQL files. Among scenarios, analysis-focused projects use R call sequence SQL files (see flow.R), database-focused project uss SSIS.cases, try write SQL files conform similar standards conventions. stated Consistency across Files (previous chapter), using consistent file structure can () improve quality code structure proven time facilitate good practices (b) allow intentions clear teammates familiar order intentions chunks.","code":""},{"path":"prototype-sql.html","id":"sql-choice","chapter":"5 Prototypical SQL File","heading":"5.1 Choice of Database Engine","text":"major relational database engines use roughly syntax, slight deviations enhancements beyond SQL standards. databases hosted SQL Server, since OUHSC’s campus seems comfortable supporting. Consequently, chapter uses SQL Server 2017+ syntax.like data science teams, still need consume databases, Oracle MySQL. Outside OUHSC projects, tend use PostgreSQL Redshift.","code":""},{"path":"prototype-sql.html","id":"sql-ferry","chapter":"5 Prototypical SQL File","heading":"5.2 Ferry","text":"basic sql file moves data within database create table named dx, contained ley_covid_1 schema cdw_staging database.","code":"--use cdw_staging\ndeclare @start_date date = '2020-02-01';                               -- sync with config.yml\ndeclare @stop_date  date = dateadd(day, -1, cast(getdate() as date));  -- sync with config.yml\n\nDROP TABLE if exists ley_covid_1.dx;\nCREATE TABLE ley_covid_1.dx(\n  dx_id           int identity  primary key,\n  patient_id      int           not null,\n  covid_confirmed bit           not null,\n  problem_date    date,\n  icd10_code      varchar(20)   not null\n);\n-- TRUNCATE TABLE ley_covid_1.dx;\n\nINSERT INTO ley_covid_1.dx\nSELECT\n  pr.patient_id\n  ,ss.covid_confirmed\n  ,pr.invoice_date     as problem_date\n  ,pr.code             as icd10_code\n  -- into ley_covid_1.dx\nFROM cdw.star_1.fact_problem       as pr\n  inner join beasley_covid_1.ss_dx as ss on pr.code = ss.icd10_code\nWHERE\n  pr.problem_date_start between @start_date and @stop_date\n  and\n  pr.patient_id is not null\nORDER BY pr.patient_id, pr.problem_date_start desc\n\nCREATE INDEX ley_covid_1_dx_patient_id on ley_covid_1.dx (patient_id);\nCREATE INDEX ley_covid_1_dx_icd10_code on ley_covid_1.dx (icd10_code);"},{"path":"prototype-sql.html","id":"sql-default-database","chapter":"5 Prototypical SQL File","heading":"5.3 Default Databases","text":"prefer specify database table, instead control connection (DSN’s “default database” value). Nevertheless, ’s helpful include default database behind comment two reasons. First, communicates default database human reader. Second, debugging, code can highlighted ADS/SSMS executed “F5”; mimic happens file run via automation DSN.","code":"--use cdw_staging"},{"path":"prototype-sql.html","id":"sql-declare","chapter":"5 Prototypical SQL File","heading":"5.4 Declare Values Databases","text":"Similar Declare Globals chunk prototypical R file, values set top file easy read modify.","code":"declare @start_date date = '2020-02-01';                               -- sync with config.yml\ndeclare @stop_date  date = dateadd(day, -1, cast(getdate() as date));  -- sync with config.yml"},{"path":"prototype-sql.html","id":"sql-recreate","chapter":"5 Prototypical SQL File","heading":"5.5 Recreate Table","text":"batch-loading data, typically easiest drop recreate database table. snippet , table specific name dropped/deleted database replaced (possibly new) definition. like dedicate line table column, least three elements per line: name, data type, nulls allowed.Many features keywords available designing tables. ones occasionally use :primary key helps database optimization later querying table, enforces uniqueness, patient table two rows patient_id value. Primary keys must nonmissing, null keyword redundant.unique helpful table additional columns need unique (patient_ssn patient_id). advanced scenario using clustered columnar table, incompatible primary key designation.identity(1, 1) creates 1, 2, 3, … sequence, relieves client creating sequence something like row_number(). Note identity column exists, number columns SELECT clause one fewer columns defined CREATE TABLE.jump-start creation table definition, frequently use clause. operation creates new table, informed column properties source tables. Within ADS SSMS, refresh list tables select new table; option copy CREATE TABLE statement (similar snippet ) paste sql file. definition can modified, tightening null null.","code":"DROP TABLE if exists ley_covid_1.dx;\nCREATE TABLE ley_covid_1.dx(\n  dx_id           int identity(1, 1) primary key,\n  patient_id      int         not null,\n  covid_confirmed bit         not null,\n  problem_date    date            null,\n  icd10_code      varchar(20) not null\n);  -- into ley_covid_1.dx"},{"path":"prototype-sql.html","id":"sql-truncate","chapter":"5 Prototypical SQL File","heading":"5.6 Truncate Table","text":"scenarios table definition stable data refreshed frequently (say, daily), consider TRUNCATE-ing table. taking approach, prefer keep DROP CREATE code file, commented . saves development time future table definition needs modified.","code":"-- TRUNCATE TABLE ley_covid_1.dx;"},{"path":"prototype-sql.html","id":"sql-insert","chapter":"5 Prototypical SQL File","heading":"5.7 INSERT INTO","text":"INSERT (followed SELECT clause), simply moves data query specified table.INSERT clause transfers columns exact order query. try match names destination table. error thrown column types mismatched (e.g., attempting insert character string integer value).Even worse, error thrown mismatched columns compatible types. occur table’s columns patient_id, weight_kg, height_cm, query’s columns patient_id, height_cm, weight_in. weight height written incorrect columns, execution catch source weight_kg, destination weight_in.","code":"INSERT INTO ley_covid_1.dx"},{"path":"prototype-sql.html","id":"sql-select","chapter":"5 Prototypical SQL File","heading":"5.8 SELECT","text":"SELECT clause specifies desired columns. can also rename columns perform manipulations.prefer specify aliased table column. two source tables column name, error thrown regarding ambiguity. Even ’s concern, believe explicitly specifying source improves readability reduces errors.","code":"SELECT\n  pr.patient_id\n  ,ss.covid_confirmed\n  ,cast(pr.invoice_datetime as date) as problem_date\n  ,pr.code                           as icd10_code"},{"path":"prototype-sql.html","id":"sql-from","chapter":"5 Prototypical SQL File","heading":"5.9 FROM","text":"","code":"FROM cdw.star_1.fact_problem       as pr\n  inner join beasley_covid_1.ss_dx as ss on pr.code = ss.icd10_code"},{"path":"prototype-sql.html","id":"sql-where","chapter":"5 Prototypical SQL File","heading":"5.10 WHERE","text":"clause reduces number returned rows (opposed reducing number columns SELECT clause). Use indention level communicate reader subclauses combined. especially important operators used, since order operations can confused easily.","code":"WHERE\n  pr.problem_date_start between @start_date and @stop_date\n  and\n  pr.patient_id is not null"},{"path":"prototype-sql.html","id":"sql-order-by","chapter":"5 Prototypical SQL File","heading":"5.11 ORDER BY","text":"ORDER clause simply specifies order rows. default, column’s values ascending order, can descending desired.","code":"ORDER BY pr.patient_id, pr.problem_date_start desc"},{"path":"prototype-sql.html","id":"sql-indexing","chapter":"5 Prototypical SQL File","heading":"5.12 Indexing","text":"table large queried variety ways, indexing table can speed performance dramatically.","code":"CREATE INDEX ley_covid_1_dx_patient_id on ley_covid_1.dx (patient_id);\nCREATE INDEX ley_covid_1_dx_icd10_code on ley_covid_1.dx (icd10_code);"},{"path":"prototype-repo.html","id":"prototype-repo","chapter":"6 Prototypical Repository","heading":"6 Prototypical Repository","text":"following file repository structure supported wide spectrum projects, ranging () small, short-term retrospective project one dataset, one manipulation file, one analysis report (b) large, multi-year project fed dozens input files support multiple statisticians sophisticated enrollment process.Looking beyond single project, strongly encourage team adopt common file organization. Pursuing commonality provides multiple benefits:evolved thought-structure makes easier follow good practices avoid common traps.evolved thought-structure makes easier follow good practices avoid common traps.Code files portable projects. code can reused environments refer files directories like config.yml, data-public/raw, data-public/derivedCode files portable projects. code can reused environments refer files directories like config.yml, data-public/raw, data-public/derivedPeople portable projects. person already familiar structure, start contributing quickly already know look statistical reports analysis/ debug problematic file ingestions manipulation/ files.People portable projects. person already familiar structure, start contributing quickly already know look statistical reports analysis/ debug problematic file ingestions manipulation/ files.specific project doesn’t use directory file, recommend retaining stub. Like empty chunks discusses Prototypical R File chapter, stub communicates collaborator, “project currently doesn’t use feature, /, location”. collaborator can stop search immediately, avoid searching weird places order rule-feature located elsewhere.template worked well us publicly available https://github.com/wibeasley/RAnalysisSkeleton. important files directories described . Please use starting point, dogmatic prison. Make adjustments fits specific project overall team.","code":""},{"path":"prototype-repo.html","id":"repo-root","chapter":"6 Prototypical Repository","heading":"6.1 Root","text":"following files live repository’s root directory, meaning subfolder/subdirectory.","code":""},{"path":"prototype-repo.html","id":"repo-config","chapter":"6 Prototypical Repository","heading":"6.1.1 config.R","text":"configuration file simply plain-text yaml file read config package. well-suited value coordinated across multiple files.Also see discussion use config file excluding bad data values config file relates yaml, json, xml.","code":"default:\n  # To be processed by Ellis lanes\n  path_subject_1_raw:  \"data-public/raw/subject-1.csv\"\n  path_mlm_1_raw:      \"data-public/raw/mlm-1.csv\"\n\n  # Central Database (produced by Ellis lanes).\n  path_database:       \"data-public/derived/db.sqlite3\"\n\n  # Analysis-ready datasets (produced by scribes & consumed by analyses).\n  path_mlm_1_derived:  \"data-public/derived/mlm-1.rds\"\n\n  # Metadata\n  path_annotation:     \"data-public/metadata/cqi-annotation.csv\"\n\n  # Logging errors and messages from automated execution.\n  path_log_flow:       !expr strftime(Sys.time(), \"data-unshared/log/flow-%Y-%m-%d--%H-%M-%S.log\")\n\n  # time_zone_local       :  \"America/Chicago\" # Force local time, in case remotely run.\n\n  # ---- Validation Ranges & Patterns ----\n  range_record_id         : !expr c(1L, 999999L)\n  range_dob               : !expr c(as.Date(\"2010-01-01\"), Sys.Date() + lubridate::days(1))\n  range_datetime_entry    : !expr c(as.POSIXct(\"2019-01-01\", tz=\"America/Chicago\"), Sys.time())\n  max_age                 : 25\n  pattern_mrn             : \"^E\\\\d{9}$\"  # An 'E', followed by 9 digits."},{"path":"prototype-repo.html","id":"repo-flow","chapter":"6 Prototypical Repository","heading":"6.1.2 flow.R","text":"workflow repo determined flow.R. calls (typically R, Python, SQL) files specific order, sending log messages file.See automation mediators details.","code":""},{"path":"prototype-repo.html","id":"repo-readme","chapter":"6 Prototypical Repository","heading":"6.1.3 README.md","text":"readme automatically displayed GitHub repository opened browser. Include static information can quickly orientate collaborator. Common elements include:Project Name (see style guide naming recommendations)Principal Investigator (ultimately accountable research) Project Coordinator (easy contact questions arise)IRB Tracking Number (whatever oversight committee reviewed approved project). help communicate accurately within larger university company.Abstract project description already written (example, part IRB submission).Documentation locations resources, described documentation/ section belowData Locations resources, \ndatabase database server\nREDCap project id url\nnetworked file share\ndatabase database serverREDCap project id urlnetworked file shareThe PI’s expectations goals analysis teamLikely deadlines, grant conference submission datesEach directory can readme file, (typical analysis projects) discourage putting much individual readme. ’ve found becomes cumbersome keep scattered files updated consistent; ’s also work reader traverse directory structure reading everything. approach concentrate information repo’s root readme, remaining readmes static unchanged across projects (e.g., generic description data-public/metadata/).","code":""},{"path":"prototype-repo.html","id":"repo-rproj","chapter":"6 Prototypical Repository","heading":"6.1.4 *.Rproj","text":"Rproj file stores project-wide settings used RStudio IDE, trailing whitespaces handled. file’s major benefit sets R session’s working directory, facilitates good discipline setting constant location files repo. Although plain-text file can edited directly, recommend using RStudio’s dialog box. good documentation Rproj settings. unsure, copy file repo’s root directory rename match repo exactly.","code":""},{"path":"prototype-repo.html","id":"repo-manipulation","chapter":"6 Prototypical Repository","heading":"6.2 manipulation/","text":"","code":""},{"path":"prototype-repo.html","id":"repo-analysis","chapter":"6 Prototypical Repository","heading":"6.3 analysis/","text":"sense, directories exist support contents analysis/. exploratory, descriptive, inferential statistics produced Rmd files. subdirectory name report, (e.g., analysis/report-te-1) within directory four files:R file contains meat analysis (e.g., analysis/report-te-1/report-te-1.R).Rmd file serves “presentation layer” calls R file (e.g., analysis/report-te-1/report-te-1.Rmd).markdown file produced directly Rmd (e.g., analysis/report-te-1/report-te-1.md). people consider intermediate file exists mostly knitr/rmarkdown/pandoc produce eventual html file.html file derived markdown file (e.g., analysis/report-te-1/report-te-1.html). markdown html files can safely discarded reproduced next time Rmd rendered. tables graphs html file self-contained, meaning single file portable emailed without concern directory read . Collaborators rarely care manipulation files analysis code; almost always look exclusively outputed html.","code":""},{"path":"prototype-repo.html","id":"repo-data-public","chapter":"6 Prototypical Repository","heading":"6.4 data-public/","text":"directory contain information sensitive proprietary. hold PHI (Protected Health Information), information like participant names, social security numbers, passwords. Files PHI stored GitHub repository, even private GitHub repository.Please see data-unshared/ options storing sensitive information.data-public/ directory typically works best organized subdirectories. commonly use subdirectories, corresponds Data Rest chapter.","code":""},{"path":"prototype-repo.html","id":"data-publicraw","chapter":"6 Prototypical Repository","heading":"6.4.1 data-public/raw/","text":"…input pipelines. datasets usually represents hard work data collection.","code":""},{"path":"prototype-repo.html","id":"data-publicmetadata","chapter":"6 Prototypical Repository","heading":"6.4.2 data-public/metadata/","text":"…definitions datasets raw. example, “gender.csv” might translate values 1 2 male female. Sometimes dataset feels natural either raw metadata subdirectory. file remain unchanged subsequent sample collected, lean towards metadata.","code":""},{"path":"prototype-repo.html","id":"data-publicderived","chapter":"6 Prototypical Repository","heading":"6.4.3 data-public/derived/","text":"…output pipelines. contents completely reproducible starting data-public/raw/ repo’s code. words, can deleted recreated ease. might contain small database file, like SQLite.","code":""},{"path":"prototype-repo.html","id":"data-publiclogs","chapter":"6 Prototypical Repository","heading":"6.4.4 data-public/logs/","text":"…logs useful collaborators necessary demonstrate something future, beyond reports contained analysis/ directory.","code":""},{"path":"prototype-repo.html","id":"data-publicoriginal","chapter":"6 Prototypical Repository","heading":"6.4.5 data-public/original/","text":"…nothing (hopefully); ideally never used. similar data-public/raw/. difference data-public/raw/ called pipeline code, data-public/original/ .file data-public/original/ typically comes investigator malformed state requires manual intervention; copied data-public/raw/. Common offenders () csv Excel file bad missing column headers, (b) strange file format readable R package, (c) corrupted file require rehabilitation utility.","code":""},{"path":"prototype-repo.html","id":"characteristics","chapter":"6 Prototypical Repository","heading":"6.4.6 Characteristics","text":"characteristics data-public/ vary based subject matter. instance, medical research projects typically use metadata directory repo, incoming information contains PHI therefore database preferred location. hand, microbiology physics research typically data protected law, desirable repo contain everything ’s unnecessarily spread .feel private GitHub repo offers adequate protection scooped biggest risk.","code":""},{"path":"prototype-repo.html","id":"repo-data-unshared","chapter":"6 Prototypical Repository","heading":"6.5 data-unshared/","text":"Files directory stored local computer, committed sent central GitHub repository/server. makes folder candidate :sensitive information, PHI (Protected Health Information). PHI involved, recommend data-unshared/ database secured networked file share feasible. See discussion .sensitive information, PHI (Protected Health Information). PHI involved, recommend data-unshared/ database secured networked file share feasible. See discussion .huge public files say, files 1+ GB easily downloadable reproducible. instance, files stable sources like US Census, Bureau Labor Statistics, dataverse.org.huge public files say, files 1+ GB easily downloadable reproducible. instance, files stable sources like US Census, Bureau Labor Statistics, dataverse.org.diagnostic logs useful collaborators.diagnostic logs useful collaborators.line repo’s .gitignore file blocks directory’s contents staged/committed (look /data-unshared/*). Since files directory committed, requires discipline communicate files collaborator’s computer. List files either repo’s readme data-unshared/contents.md; minimum declare name file can downloaded reproduced. (curious, !data-unshared/contents.md line .gitignore declares exception markdown file committed updated collaborator’s machine.)Even though files kept central repository, recommend encrypting local drive data-unshared/ contains sensitive data (PHI). See data-public/ README.md information.directory works best subdirectories described organization data-public/.Compared data-unshared/, prefer storing PHI enterprise database (SQL Server, PostgreSQL, MariaDB/MySQL, Oracle) networked drive four reasons.central resources typically managed Campus reviewed security professionals.’s trivial stay synchronized across collaborators file share database. contrast, data-unshared/ isn’t synchronized across machines extra discipline required tell collaborators update machines.’s sometimes possible recover lost data file share database. ’s much less likely turn back clock data-unshared/ files.’s unlikely mess .gitignore entries allow sensitive files committed repository. sensitive information stored data-unshared/, important review every commit ensure information isn’t sneak repo.","code":""},{"path":"prototype-repo.html","id":"repo-documentation","chapter":"6 Prototypical Repository","heading":"6.6 documentation/","text":"Good documentation scarce documentation files consume little space, liberally copy everything get directory. helpful include:Approval letters IRB oversight board. especially important also gatekeeper database, must justify releasing sensitive information.Data dictionaries incoming datasets team ingesting.Data dictionaries derived datasets team producing.documentation public stable, like CDC’s site vaccination codes, include url repo’s readme. feel information location may change, copy url also full document easier reconstruct logic returning project years.","code":""},{"path":"prototype-repo.html","id":"repo-optional","chapter":"6 Prototypical Repository","heading":"6.7 Optional","text":"Everything mentioned now exist repo, even file directory empty. projects benefit following additional capabilities.","code":""},{"path":"prototype-repo.html","id":"repo-description","chapter":"6 Prototypical Repository","heading":"6.7.1 DESCRIPTION","text":"plain-text DESCRIPTION file lives repo’s root directory –see example R Analysis Skeleton. file allows repo become R package, provides following benefits even never deployed CRAN.specify packages (versions) required code. include packages aren’t available CRAN, like OuhscBbmc/OuhscMunge.better unify test common code called multiple files.better document functions datasets within repo.last two bullets essentially upgrade merely sticking code file sourcing .package offers many capabilities beyond listed , typical data science repo scratch surface. larger topic covered Hadley Wickham’s R Packages.","code":""},{"path":"prototype-repo.html","id":"repo-utility","chapter":"6 Prototypical Repository","heading":"6.7.2 utility/","text":"Include files may run occasionally, required reproduce analyses. Examples include:code submitting entire repo pipeline super computer,simulate artificial demonstration data, orrunning diagnostic checks code using something like goodpractice urlchecker.","code":""},{"path":"prototype-repo.html","id":"repo-stitched","chapter":"6 Prototypical Repository","heading":"6.7.3 stitched-output/","text":"Stitching light-weight capability knitr/rmarkdown. stitch repo’s files (server type logging), consider directing output directory. basic call :don’t use approach medical research, sensitive information usually contained output, sensitive patient information stored repo. (’s last time ’ll talk sensitive information –least chapter.)","code":"\nknitr::stitch_rmd(\n  script = \"manipulation/car-ellis.R\",\n  output = \"stitched-output/manipulation/car-ellis.md\"\n)"},{"path":"rest.html","id":"rest","chapter":"7 Data at Rest","heading":"7 Data at Rest","text":"","code":""},{"path":"rest.html","id":"rest-states","chapter":"7 Data at Rest","heading":"7.1 Data States","text":"extension data-public/ discussion. chapter theoretical applies forms data, just files prototypical repo.easiest demarcate data two states: raw derived. Raw data represents input pipelines. Sometimes junk. usually files cherished culmination hard work data collection. Derived data represents output pipelines. contents completely reproducible starting raw data repo’s code. words, derived information can deleted recreated ease.terminology, original data file directly received collaborator. good day, “original” “raw” synonymous. Meaning files received ingestible directly pipeline. However sometimes collaborator provides malformed data file requires manual intervention. rehabilitated, becomes raw data. Common offenders () csv Excel file bad missing column headers, (b) strange file format readable R package, (c) corrupted file require rehabilitation utility.original file isn’t perfect, ’ll decide blemishes can programmatically fixed, blemishes manually fixed. triage process, sometimes difficult determine worth investing time fix code. everything can fixed code, original raw data equivalent (“original” state can ignored).heuristics help decide address manually programmatically.Arguments Programmatic Fixes:original data frequently refreshed. pipeline ingests new files every day, ’s probably worth investment fix.original data frequently refreshed. pipeline ingests new files every day, ’s probably worth investment fix.*code *code wouldArguments Manual Fixes:corrections subjective. Sometimes desired fix follow deterministic rules. scenarios, see “Return file collaborator” alternative.corrections subjective. Sometimes desired fix follow deterministic rules. scenarios, see “Return file collaborator” alternative.’s quick fix one-time dataset.’s quick fix one-time dataset.Alternatives:Return file collaborator. Especially grad students interns available. One justification ’re usually experts field, . better equipped evaluate data point context determine correct correction. second justification company/university probably doesn’t want pay statisticians data scientists clean upIf corporate consultant, propose team willing fix data points provide estimated cost training personnel correctly evaluate context client can offload task.Separate excise manual step. majority file can ingested without manual intervention, try split task two. Consider patient’s visit record hospital database. information well-structured easily transformed discrete cells. However “visit notes” written nurses physician . Sometimes notes areSeparate excise manual step. majority file can ingested without manual intervention, try split task two. Consider patient’s visit record hospital database. information well-structured easily transformed discrete cells. However “visit notes” written nurses physician . Sometimes notes areRawRawDerived\nProject-wide File Repo\nProject-wide File Protected File Server\nUser-specific File Protected File Server\nProject-wide Database\nDerivedProject-wide File RepoProject-wide File Protected File ServerUser-specific File Protected File ServerProject-wide DatabaseOriginalOriginal","code":""},{"path":"rest.html","id":"data-containers","chapter":"7 Data at Rest","heading":"7.2 Data Containers","text":"","code":""},{"path":"rest.html","id":"rest-containers-csv","chapter":"7 Data at Rest","heading":"7.2.1 csv","text":"exchanging data two different systems, preferred format frequently plain text, cell record separated comma. commonly called csv –comma separated value file. opposed proprietary formats like xlsx sas7bdat, csv file easily opened parsable statistical software, even conventional text editors GitHub.","code":""},{"path":"rest.html","id":"rest-containers-rds","chapter":"7 Data at Rest","heading":"7.2.2 rds","text":"","code":""},{"path":"rest.html","id":"rest-containers-yaml","chapter":"7 Data at Rest","heading":"7.2.3 yaml, json, and xml","text":"yaml, json, xml three plain-text hierarchical formats commonly used data structure naturally represented rectangle set rectangles (therefore good fit csv rds). unsure start nested dataset, see tidyr’s Rectangling vignette.way advocate simplest recoding function adequate task, prefer yaml json, json xml. Yaml accommodates , needs. Initially may tricky correctly use whitespacing specify correct nesting structure yaml, familar, file easy read edit, Git diffs can quickly reviewed. yaml package reads yaml file, returns (nested) R list; can also convert R list yaml file.config package wraps yaml package fill common need: retrieving repository configuration information yaml file. recommend using config package fits. ways functionality simplification yaml package, extension ways. example, value follows !expr, R evaluate expression. commonly specify allowable ranges variables config.ymlSee discussion config.yml prototypical repository, well.","code":"range_dob: !expr c(as.Date(\"2010-01-01\"), Sys.Date() + lubridate::days(1))"},{"path":"rest.html","id":"rest-containers-arrow","chapter":"7 Data at Rest","heading":"7.2.4 Arrow","text":"Apache Arrow open source specification developed work many languages R, Spark, Python, many others. accommodates nice rectangles CSVs used, hierarchical nesting json xml used.-memory specification (allows Python process directly access R object), -disk specification (allows Python process read saved R file). file format compressed, takes much less space store disk less time transfer network.downside file plain-text, binary. means file readable editable many programs, hurts project’s portability. wouldn’t want store metadata files arrow collaborators couldn’t easily help map values qqq","code":""},{"path":"rest.html","id":"rest-containers-sqlite","chapter":"7 Data at Rest","heading":"7.2.5 SQLite","text":"","code":""},{"path":"rest.html","id":"rest-containers-database","chapter":"7 Data at Rest","heading":"7.2.6 Central Enterprise database","text":"","code":""},{"path":"rest.html","id":"rest-containers-redcap","chapter":"7 Data at Rest","heading":"7.2.7 Central REDCap database","text":"","code":""},{"path":"rest.html","id":"rest-containers-avoid","chapter":"7 Data at Rest","heading":"7.2.8 Containers to avoid","text":"","code":""},{"path":"rest.html","id":"rest-containers-avoid-spreadsheets","chapter":"7 Data at Rest","heading":"7.2.8.1 Spreadsheets","text":"Try receive data Excel files. think Excel can useful light brainstorming prototyping equations –trusted transport serious information. spreadsheet software like LibreOffice Calc less problematic experience, still less desirable formats mentioned .receive csv open typical spreadsheet program, strongly recommend save , potential mangling values. close spreadsheet, review Git commits verify values corrupted.See appendix list ways analyses can undermined receiving Excel files, well template correspond less-experienced colleagues sending team Excel files.","code":""},{"path":"rest.html","id":"rest-containers-avoid-proprietary","chapter":"7 Data at Rest","heading":"7.2.8.2 Proprietary","text":"Proprietary formats like SAS’s “sas7bdat” less accessible people without current expensive software licenses. Therefore distributing proprietary file formats hurts reproducibility decreases project’s impact. hand, using proprietary formats may advantageous need conceal project’s failure.formerly distributed sas7bdat files supplement (otherwise identical) csvs, order cater suprisingly large population SAS users unfamiliar proc import Google search engine. Recently distributed csvs, example code reading file SAS.","code":""},{"path":"rest.html","id":"data-conventions","chapter":"7 Data at Rest","heading":"7.3 Storage Conventions","text":"","code":""},{"path":"rest.html","id":"rest-conventions-all","chapter":"7 Data at Rest","heading":"7.3.1 All Sources","text":"Across file formats, conventions usually work best.consistency across versions: use script produce dataset, inform recipient dataset’s structure changes. processes automated, changes trivial humans (e.g., yyyy-mm-dd mm/dd-yy) break automation.\nspecificity automation intentional. install guards processes bad values pass. instance, may place bounds toddlers’ age 12 36 months. want automation break next dataset contains age values 1 3 (years). downstream analysis (say, regression model age predictor variable) produce misleading results shift months years went undetected.consistency across versions: use script produce dataset, inform recipient dataset’s structure changes. processes automated, changes trivial humans (e.g., yyyy-mm-dd mm/dd-yy) break automation.specificity automation intentional. install guards processes bad values pass. instance, may place bounds toddlers’ age 12 36 months. want automation break next dataset contains age values 1 3 (years). downstream analysis (say, regression model age predictor variable) produce misleading results shift months years went undetected.date format: specify YYYY-MM-DD (ISO-8601)date format: specify YYYY-MM-DD (ISO-8601)time format: specify HH:MM HH:MM:SS, preferably 24-hour time. Use leading zero midnight 9:59am, colon separating hours, minutes, seconds (.e., 09:59)time format: specify HH:MM HH:MM:SS, preferably 24-hour time. Use leading zero midnight 9:59am, colon separating hours, minutes, seconds (.e., 09:59)patient names: separate name_last, name_first, name_middle three distinct variables possible.patient names: separate name_last, name_first, name_middle three distinct variables possible.currency: represent money integer floating-point variable. representation easily parsable software, enables mathematical operations (like max() mean()) performed directly. Avoid commas symbols like “$”. possibility ambiguity, indicate denomination variable name (e.g., payment_dollars payment_euros).currency: represent money integer floating-point variable. representation easily parsable software, enables mathematical operations (like max() mean()) performed directly. Avoid commas symbols like “$”. possibility ambiguity, indicate denomination variable name (e.g., payment_dollars payment_euros).","code":""},{"path":"rest.html","id":"rest-conventions-text","chapter":"7 Data at Rest","heading":"7.3.2 Text","text":"conventions usually work best within plain-text formats.csv: comma separated values common plain-text format, better support similar formats cells separated tabs semi-colons. However, receiving well-behaved file separated characters, thankful go flow.csv: comma separated values common plain-text format, better support similar formats cells separated tabs semi-colons. However, receiving well-behaved file separated characters, thankful go flow.cells enclosed quotes: ‘cell’ enclosed double quotes, especially ’s string/character variable.cells enclosed quotes: ‘cell’ enclosed double quotes, especially ’s string/character variable.","code":""},{"path":"rest.html","id":"rest-conventions-excel","chapter":"7 Data at Rest","heading":"7.3.3 Excel","text":"discussed avoid Excel. possible, conventions helps reduce ambiguity corrupted values. See appendix preferred approach reading Excel files.avoid multiple tabs/worksheets: Excel files containing multiple worksheets complicated read automation, produces opportunities inconsistent variables across tabs/worksheets.avoid multiple tabs/worksheets: Excel files containing multiple worksheets complicated read automation, produces opportunities inconsistent variables across tabs/worksheets.save cells text: avoiding Excel attempting save cells dates numbers. Admitedly, last-ditch effort. someone using Excel convert cells text, values probably already corrupted.save cells text: avoiding Excel attempting save cells dates numbers. Admitedly, last-ditch effort. someone using Excel convert cells text, values probably already corrupted.","code":""},{"path":"rest.html","id":"rest-conventions-meditech","chapter":"7 Data at Rest","heading":"7.3.4 Meditech","text":"patient identifier: mrn_meditech instead mrn, MRN Rec#, Med Rec#.patient identifier: mrn_meditech instead mrn, MRN Rec#, Med Rec#.account/admission identifier: account_number instead mrn, Acct#, Account#.account/admission identifier: account_number instead mrn, Acct#, Account#.patient’s full name: name_full instead Patient Name Name.patient’s full name: name_full instead Patient Name Name.long/tall format: one row per dx per patient (50 dxs) instead 50 columns dx per patient. Applies \ndiagnosis code & description\norder date & number\nprocedure name & number\nlong/tall format: one row per dx per patient (50 dxs) instead 50 columns dx per patient. Applies todiagnosis code & descriptiondiagnosis code & descriptionorder date & numberorder date & numberprocedure name & numberprocedure name & numberMeditech Idiosyncracies:blood pressure: systems bp_diastolic bp_systolic values stored separate integer variables. Meditech, stored single character variable, separated forward slash.","code":""},{"path":"rest.html","id":"rest-conventions-database","chapter":"7 Data at Rest","heading":"7.3.5 Databases","text":"exchanging data two different systems, …","code":""},{"path":"patterns.html","id":"patterns","chapter":"8 Patterns","heading":"8 Patterns","text":"","code":""},{"path":"patterns.html","id":"pattern-ellis","chapter":"8 Patterns","heading":"8.1 Ellis","text":"","code":""},{"path":"patterns.html","id":"purpose","chapter":"8 Patterns","heading":"8.1.1 Purpose","text":"incorporate outside data source system safely.","code":""},{"path":"patterns.html","id":"philosophy","chapter":"8 Patterns","heading":"8.1.2 Philosophy","text":"Without data immigration, warehouses useless. Embrace power fresh information way :\nrepeatable data source updated (refresh warehouse)\nsimilar Ellis lanes (designed data sources) don’t learn/remember entirely new pattern. (Like Rubiks cube instructions.)\nWithout data immigration, warehouses useless. Embrace power fresh information way :repeatable data source updated (refresh warehouse)similar Ellis lanes (designed data sources) don’t learn/remember entirely new pattern. (Like Rubiks cube instructions.)","code":""},{"path":"patterns.html","id":"guidelines","chapter":"8 Patterns","heading":"8.1.3 Guidelines","text":"Take small bites.\nLike software development, don’t tackle complexity first time. Start processing important columns incorporating move.\nUse variables need short-term, especially new projects. everyone knows, variables upstream source can change. Don’t spend effort writing code variables won’t need months/years; ’ll likely change need .\nrow passes verify-values chunk, ’re accountable failures causes warehouse. analysts know external data messy, don’t surprised. Sometimes ’ll spend hour writing Ellis 6 columns.\nTake small bites.Like software development, don’t tackle complexity first time. Start processing important columns incorporating move.Use variables need short-term, especially new projects. everyone knows, variables upstream source can change. Don’t spend effort writing code variables won’t need months/years; ’ll likely change need .row passes verify-values chunk, ’re accountable failures causes warehouse. analysts know external data messy, don’t surprised. Sometimes ’ll spend hour writing Ellis 6 columns.Narrowly define Ellis lane. One code file strive () consume one CSV (b) produce one table. Exceptions include:\nmultiple input files related, really belong together (e.g., one CSV per month, one CSV per clinic). scenario pretty common.\nCSV legitimately produce two different tables munging. happens infrequently, one warehouse table needs wide, another long.\nNarrowly define Ellis lane. One code file strive () consume one CSV (b) produce one table. Exceptions include:multiple input files related, really belong together (e.g., one CSV per month, one CSV per clinic). scenario pretty common.CSV legitimately produce two different tables munging. happens infrequently, one warehouse table needs wide, another long.","code":""},{"path":"patterns.html","id":"examples","chapter":"8 Patterns","heading":"8.1.4 Examples","text":"https://github.com/wibeasley/RAnalysisSkeleton/blob/main/manipulation/te-ellis.Rhttps://github.com/wibeasley/RAnalysisSkeleton/blob/main/manipulation/https://github.com/OuhscBbmc/usnavy-billets/blob/main/manipulation/survey-ellis.R","code":""},{"path":"patterns.html","id":"elements","chapter":"8 Patterns","heading":"8.1.5 Elements","text":"Clear memory scripting languages like R (unlike compiled languages like Java), ’s easy old variables hang around. Explicitly clear run file .\n\nrm(list = ls(= TRUE)) # Clear memory variables previous run. called knitr, first chunk.Clear memory scripting languages like R (unlike compiled languages like Java), ’s easy old variables hang around. Explicitly clear run file .Load Sources R, source()d file run execute code. prefer sourced file load variables (like function definitions), instead real operations like read dataset perform calculation. many times want function available multiple files repo; two approaches like. first collecting common functions single file (sourcing callers). second make repo legitimate R package.\nfirst approach better suited quick & easy development. second allows add documentation unit tests.\n\n# ---- load-sources ------------------------------------------------------------\nsource(\"./manipulation/osdh/ellis/common-ellis.R\")Load Sources R, source()d file run execute code. prefer sourced file load variables (like function definitions), instead real operations like read dataset perform calculation. many times want function available multiple files repo; two approaches like. first collecting common functions single file (sourcing callers). second make repo legitimate R package.first approach better suited quick & easy development. second allows add documentation unit tests.Load Packages another precaution necessary scripting language. Determine necessary packages available machine. Avoiding attaching packages (library() function) possible. functions don’t need qualified (e.g., dplyr::intersect()) cause naming conflicts. Even can guarantee don’t conflict packages now, packages add new functions future conflict.\n\n# ---- load-packages -----------------------------------------------------------\n# Attach package(s) functions need qualified: http://r-pkgs..co.nz/namespace.html#search-path\nlibrary(magrittr            , quietly=TRUE)\nlibrary(DBI                 , quietly=TRUE)\n\n# Verify packages available machine, functions need qualified: http://r-pkgs..co.nz/namespace.html#search-path\nrequireNamespace(\"readr\"        )\nrequireNamespace(\"tidyr\"        )\nrequireNamespace(\"dplyr\"        ) # Avoid attaching dplyr, b/c function names conflict lot packages (esp base, stats, plyr).\nrequireNamespace(\"testit\")\nrequireNamespace(\"checkmate\")\nrequireNamespace(\"OuhscMunge\") # remotes::install_github(repo=\"OuhscBbmc/OuhscMunge\")Load Packages another precaution necessary scripting language. Determine necessary packages available machine. Avoiding attaching packages (library() function) possible. functions don’t need qualified (e.g., dplyr::intersect()) cause naming conflicts. Even can guarantee don’t conflict packages now, packages add new functions future conflict.Declare Global Variables Functions. includes defining expected column names types data sources; use readr::cols_only() (opposed readr::cols()) ignore new columns may added since dataset’s last refresh.\n\n# ---- declare-globals ---------------------------------------------------------Declare Global Variables Functions. includes defining expected column names types data sources; use readr::cols_only() (opposed readr::cols()) ignore new columns may added since dataset’s last refresh.Load Data Source(s) See load-data chunk described prototypical file.\n\n# ---- load-data ---------------------------------------------------------------Load Data Source(s) See load-data chunk described prototypical file.Tweak Data\nSee tweak-data chunk described prototypical file.\n\n# ---- tweak-data --------------------------------------------------------------Tweak DataSee tweak-data chunk described prototypical file.Body EllisBody EllisVerifyVerifySpecify Columns\nSee specify-columns--upload chunk described prototypical file.\n\n# ---- specify-columns--upload -----------------------------------------------Specify ColumnsSee specify-columns--upload chunk described prototypical file.Welcome warehouse. chunk, nothing persisted.\n\n# ---- save--db --------------------------------------------------------------\n# ---- save--disk ------------------------------------------------------------Welcome warehouse. chunk, nothing persisted.","code":"\nrm(list = ls(all = TRUE)) # Clear the memory of variables from previous run. This is not called by knitr, because it's above the first chunk.\n# ---- load-sources ------------------------------------------------------------\nsource(\"./manipulation/osdh/ellis/common-ellis.R\")\n# ---- load-packages -----------------------------------------------------------\n# Attach these package(s) so their functions don't need to be qualified: http://r-pkgs.had.co.nz/namespace.html#search-path\nlibrary(magrittr            , quietly=TRUE)\nlibrary(DBI                 , quietly=TRUE)\n\n# Verify these packages are available on the machine, but their functions need to be qualified: http://r-pkgs.had.co.nz/namespace.html#search-path\nrequireNamespace(\"readr\"        )\nrequireNamespace(\"tidyr\"        )\nrequireNamespace(\"dplyr\"        ) # Avoid attaching dplyr, b/c its function names conflict with a lot of packages (esp base, stats, and plyr).\nrequireNamespace(\"testit\")\nrequireNamespace(\"checkmate\")\nrequireNamespace(\"OuhscMunge\") # remotes::install_github(repo=\"OuhscBbmc/OuhscMunge\")\n# ---- declare-globals ---------------------------------------------------------\n# ---- load-data ---------------------------------------------------------------\n# ---- tweak-data --------------------------------------------------------------\n# ---- specify-columns-to-upload -----------------------------------------------\n# ---- save-to-db --------------------------------------------------------------\n# ---- save-to-disk ------------------------------------------------------------"},{"path":"patterns.html","id":"pattern-arch","chapter":"8 Patterns","heading":"8.2 Arch","text":"","code":""},{"path":"patterns.html","id":"pattern-ferry","chapter":"8 Patterns","heading":"8.3 Ferry","text":"","code":""},{"path":"patterns.html","id":"pattern-scribe","chapter":"8 Patterns","heading":"8.4 Scribe","text":"","code":""},{"path":"patterns.html","id":"pattern-analysis","chapter":"8 Patterns","heading":"8.5 Analysis","text":"","code":""},{"path":"patterns.html","id":"pattern-presentation-static","chapter":"8 Patterns","heading":"8.6 Presentation -Static","text":"","code":""},{"path":"patterns.html","id":"pattern-presentation-interactive","chapter":"8 Patterns","heading":"8.7 Presentation -Interactive","text":"","code":""},{"path":"patterns.html","id":"pattern-metadata","chapter":"8 Patterns","heading":"8.8 Metadata","text":"Survey items can change across time (justified unjustified reasons). prefer dedicate metadata csv single variablehttps://github.com/LiveOak/vasquez-mexican-census-1/issues/17#issuecomment-567254695","code":""},{"path":"patterns.html","id":"primary-rules-for-mapping","chapter":"8 Patterns","heading":"8.8.1 Primary Rules for Mapping","text":"important rules necessary map concepts multidimensional space.variable gets csv, relationship.csv (show ), education.csv, living-status.csv, race.csv. ’s easiest file name matches variable.variable gets csv, relationship.csv (show ), education.csv, living-status.csv, race.csv. ’s easiest file name matches variable.variable also needs unique integer identifies underlying level database, education_id, living_status_id, relationship_id.variable also needs unique integer identifies underlying level database, education_id, living_status_id, relationship_id.survey wave gets column within csv, code_2011 code_2016.survey wave gets column within csv, code_2011 code_2016.level within variable-wave gets row, like Jefe, Esposo, Hijo.level within variable-wave gets row, like Jefe, Esposo, Hijo.","code":""},{"path":"patterns.html","id":"secondary-rules-for-mapping","chapter":"8 Patterns","heading":"8.8.2 Secondary Rules for Mapping","text":"scenarios, first three columns critical (.e., relationship_id, code_2011, code_2016). Yet additional guidelines help plumbing manipulation lookup variables.variable also needs unique name identifies underlying level human, education, living_status, relationship. human label corresponding relationship_id. ’s easiest column name matches variable.variable also needs unique name identifies underlying level human, education, living_status, relationship. human label corresponding relationship_id. ’s easiest column name matches variable.survey wave gets column within csv, description_2011 description_2016. human labels corresponding variables like code_2011 code_2016.survey wave gets column within csv, description_2011 description_2016. human labels corresponding variables like code_2011 code_2016.variable benefits unique display order value, used later analyses. Categorical variables typically desired sequence graph legends tables; specify order . helps define factor levels R pandas.Categorical levels Python.variable benefits unique display order value, used later analyses. Categorical variables typically desired sequence graph legends tables; specify order . helps define factor levels R pandas.Categorical levels Python.Mappings usually informed outside documentation. transparency maintainability, clearly describe documentation can found. One option include data-public/metadata/README.md. Another option include bottom csv, preceded #, ‘comment’ character can keep csv-parser treating notes like data needs squeeze cells. Notes example :\n# Notes,,,,,,\n# 2016 codes come `documentation/2106/fd_endireh2016_dbf.pdf`, pages 14-15,,,,,\n# 2011 codes come `documentation/2011/fd_endireh11.xls`, ‘TSDem’ tab,,,,,Mappings usually informed outside documentation. transparency maintainability, clearly describe documentation can found. One option include data-public/metadata/README.md. Another option include bottom csv, preceded #, ‘comment’ character can keep csv-parser treating notes like data needs squeeze cells. Notes example :sometimes notes column helps humans keep things straight, especially researchers new field/project. example , notes value first row might “jefe means ‘head’, ‘boss’”.sometimes notes column helps humans keep things straight, especially researchers new field/project. example , notes value first row might “jefe means ‘head’, ‘boss’”.","code":"# Notes,,,,,,\n# 2016 codes come from `documentation/2106/fd_endireh2016_dbf.pdf`, pages 14-15,,,,,\n# 2011 codes come from `documentation/2011/fd_endireh11.xls`, ‘TSDem’ tab,,,,,"},{"path":"security.html","id":"security","chapter":"9 Security & Private Data","heading":"9 Security & Private Data","text":"Overview{Include paragraphs describe principles mentality, following sections contribute.}report’s dataset(s) preferably stored REDCap SQL Server.\n’re absolutely stored GitHub local machine.\nAvoid Microsoft Access, Excel, CSVs, anything without user accounts.\nPHI must stored loose file (eg, CSV), keep encrypted file server.\nPHI fileserver stored directory controlled fairly restrictive Windows AD group. ~4 people project probably need access files, ~20 people project.\nmany benefits SQL Server CSVs Excel files .\n’s protected Odyssey (just VPN).\nprovides auditing logs.\nprovides schemas partition authorization.\nReal databases aren’t accidentally emailed copied unsecured location.\nTransfer PHI REDCap & SQL Server early possible (particularly CSVs & XLSXs regularly receive partners).\nTemporary derivative datasets stored SQL Server, CSV fileserver.","code":""},{"path":"security.html","id":"security-guidelines","chapter":"9 Security & Private Data","heading":"9.1 Security Guidelines","text":"encounter decision ’s described chapter’s security practices, follow underlying concepts. course, consult people.Principle least privilege: expose little possible.\nLimit number team members.\nLimit amount data (consider rows & columns).\nObfuscate values remove unnecessary PHI derivative datasets.\nLimit number team members.Limit amount data (consider rows & columns).Obfuscate values remove unnecessary PHI derivative datasets.Redundant layers protection.\nsingle point failure shouldn’t enough breach PHI security.\nsingle point failure shouldn’t enough breach PHI security.Simplicity possible.\nStore data two houses (eg, REDCap & SQL Server).\nEasier identify & manage bunch PHI CSVs scattered across dozen folders, versions.\nManipulate data programmatically, manually.\n\nWindows AD account controls everything, indirectly directly:\nVPN, Odyssey, file server, SQL, REDCap, & REDCap API.\n\nStore data two houses (eg, REDCap & SQL Server).Easier identify & manage bunch PHI CSVs scattered across dozen folders, versions.\nManipulate data programmatically, manually.\nManipulate data programmatically, manually.Windows AD account controls everything, indirectly directly:\nVPN, Odyssey, file server, SQL, REDCap, & REDCap API.\nVPN, Odyssey, file server, SQL, REDCap, & REDCap API.Lock team members possible.\n’s don’t trust lot unnecessary data, ’s don’t trust ex-boyfriends coffee shop hackers.\n’s don’t trust lot unnecessary data, ’s don’t trust ex-boyfriends coffee shop hackers.","code":""},{"path":"security.html","id":"dataset-level-redaction","chapter":"9 Security & Private Data","heading":"9.2 Dataset-level Redaction","text":"Several multi-layered strategies exist prevent exposing PHI. One approach simply reduce information contained variable. Much information medical record useful modeling descriptive statistics, therefore can omitted downstream datasets. techniques include:Remove variable: empty bucket nothing leak.Decrease resolution: Many times, patient’s year birth adequate analysis, include month day unnecessary risks.Hash salt identifiers: use cryptographic-quality algorithms transform ID derived value. example, “234” becomes “1432c1a399”. original value 234 recoverable 1432c1a399. two rows 1432c1a399 still attributed patient statistical model.","code":""},{"path":"security.html","id":"security-for-data-at-rest","chapter":"9 Security & Private Data","heading":"9.3 Security for Data at Rest","text":"report’s dataset(s) preferably stored REDCap SQL Server.\n’re absolutely stored GitHub local machine.\nAvoid Microsoft Access, Excel, CSVs, anything without user accounts.\nPHI must stored loose file (eg, CSV), keep encrypted file server.\n’re absolutely stored GitHub local machine.Avoid Microsoft Access, Excel, CSVs, anything without user accounts.PHI must stored loose file (eg, CSV), keep encrypted file server.PHI fileserver stored directory controlled fairly restrictive Windows AD group. ~4 people project probably need access files, ~20 people project.many benefits SQL Server CSVs Excel files .\n’s protected Odyssey (just VPN).\nprovides auditing logs.\nprovides schemas partition authorization.\nReal databases aren’t accidentally emailed copied unsecured location.\n’s protected Odyssey (just VPN).provides auditing logs.provides schemas partition authorization.Real databases aren’t accidentally emailed copied unsecured location.Transfer PHI REDCap & SQL Server early possible (particularly CSVs & XLSXs regularly receive partners).Temporary derivative datasets stored SQL Server, CSV fileserver.Hash values possible. instance, determine families/networks people, use things like SSNs. algorithm identifies clusters doesn’t need know actual SSN, just two records SSN. Something like SHA-256 hash good . algorithm can operate hashed SSN just effectively real SSN. However original SSN can’t determined hashed value. table accidentally exposed public, PHI compromised. following two files help hashing & salting process: HashUtility.R CreateSalt.R.","code":""},{"path":"security.html","id":"file-level-permissions","chapter":"9 Security & Private Data","heading":"9.4 File-level permissions","text":"","code":""},{"path":"security.html","id":"database-permissions","chapter":"9 Security & Private Data","heading":"9.5 Database permissions","text":"","code":""},{"path":"security.html","id":"public-private-repositories","chapter":"9 Security & Private Data","heading":"9.6 Public & Private Repositories","text":"","code":""},{"path":"security.html","id":"repo-rules","chapter":"9 Security & Private Data","heading":"9.6.1 Repo Rules","text":"code repository private, restricted necessary project members.repo controled OUHSC organization, individual’s private account..gitignore file prohibits common data file formats pushed/uploaded central repository.\nExamples: accdb, mdb, xlsx, csv, sas7bdat, rdata, RHistory.\ntext file without PHI must GitHub, create new extension like ’*.PhiFree’.\ncan include specific exception .gitignore file, adding exclamation point front file, !RecruitmentProductivity/RecruitingZones/ZipcodesToZone.csv. example included current repository’s [.gitignore file(https://github.com/OuhscBbmc/RedcapExamplesAndPatterns/blob/main/.gitignore).\nExamples: accdb, mdb, xlsx, csv, sas7bdat, rdata, RHistory.text file without PHI must GitHub, create new extension like ’*.PhiFree’.can include specific exception .gitignore file, adding exclamation point front file, !RecruitmentProductivity/RecruitingZones/ZipcodesToZone.csv. example included current repository’s [.gitignore file(https://github.com/OuhscBbmc/RedcapExamplesAndPatterns/blob/main/.gitignore).","code":""},{"path":"security.html","id":"scrubbing-github-history","chapter":"9 Security & Private Data","heading":"9.6.2 Scrubbing GitHub history","text":"Occasionally files may committed git repository need removed completely. just current collections files (.e., branch’s head), entire history repo.Scrubbing require typically () sensitive file accidentally committed pushed GitHub, (b) huge file bloated repository disrupted productivity.two suitable scrubbing approaches require command line. first git-filter-branch command within git, second BFG repo-cleaner. use second approach, [recommended GitHub]; requires 15 minutes install configure scratch, much easier develop , executes much faster.bash-centric steps remove files repo history called ‘monster-data.csv’ ‘bloated’ repository.file contains passwords, change immediately.file contains passwords, change immediately.Delete ‘monster-data.csv’ branch push commit GitHub.Delete ‘monster-data.csv’ branch push commit GitHub.Ask collaborators push outstanding commits GitHub delete local copy repo. scrubbing complete, re-clone .Ask collaborators push outstanding commits GitHub delete local copy repo. scrubbing complete, re-clone .Download install recent Java JRE Oracle site.Download install recent Java JRE Oracle site.Download recent jar file BFG site home directory.Download recent jar file BFG site home directory.Clone fresh copy repository user’s home directory. --mirror argument avoids downloading every file, downloads bookkeeping details required scrubbing.\ncd ~\ngit clone --mirror https://github.com/-org/bloated.gitClone fresh copy repository user’s home directory. --mirror argument avoids downloading every file, downloads bookkeeping details required scrubbing.Remove files (directory) called ‘monster-data.csv’.\njava -jar bfg-*.jar --delete-files monster-data.csv bloated.gitRemove files (directory) called ‘monster-data.csv’.Reflog garbage collect repo.\ncd bloated.git\ngit reflog expire --expire=now --&& git gc --prune=now --aggressiveReflog garbage collect repo.Push local changes GitHub server.\ngit pushPush local changes GitHub server.Delete bfg jar home directory.\ncd ~\nrm bfg-*.jarDelete bfg jar home directory.Ask collaborators re-clone repo local machine. important restart fresh copy, -scrubbed file reintroduced repo’s history.Ask collaborators re-clone repo local machine. important restart fresh copy, -scrubbed file reintroduced repo’s history.file contains sensitive information, like passwords PHI, ask GitHub support refresh cache file’s history isn’t accessible website, even repo private.\nGitHub provides chatbot helps submit request. time writing, go https://support.github.com/request?tags=docs-generic&q=remove+cached+views click “Clear cached views Virtual Agent” blue button.file contains sensitive information, like passwords PHI, ask GitHub support refresh cache file’s history isn’t accessible website, even repo private.GitHub provides chatbot helps submit request. time writing, go https://support.github.com/request?tags=docs-generic&q=remove+cached+views click “Clear cached views Virtual Agent” blue button.","code":"cd ~\ngit clone --mirror https://github.com/your-org/bloated.gitjava -jar bfg-*.jar --delete-files monster-data.csv bloated.gitcd bloated.git\ngit reflog expire --expire=now --all && git gc --prune=now --aggressivegit pushcd ~\nrm bfg-*.jar"},{"path":"security.html","id":"resources","chapter":"9 Security & Private Data","heading":"9.6.2.0.1 Resources","text":"BFG Repo-Cleaner siteAdditional BFG instructionsGitHub Sensitive Data Removal PolicyGitHub Removing sensitive data repository","code":""},{"path":"automation.html","id":"automation","chapter":"10 Automation & Reproducibility","heading":"10 Automation & Reproducibility","text":"Automation important prerequisite reproducibility.","code":""},{"path":"automation.html","id":"automation-mediator","chapter":"10 Automation & Reproducibility","heading":"10.1 Mediator","text":"nontrivial project usually multiple stages pipeline. Instead human deciding execute piece, single file execute pieces. single file makes project portable, also clearly documents process.single file special cases mediator pattern, sense defines piece relates .","code":""},{"path":"automation.html","id":"automation-flow","chapter":"10 Automation & Reproducibility","heading":"10.1.1 Flow File in R","text":"{Describe https://github.com/wibeasley/RAnalysisSkeleton/blob/main/flow.R.}See also prototypical repo.","code":""},{"path":"automation.html","id":"automation-makefile","chapter":"10 Automation & Reproducibility","heading":"10.1.2 Makefile","text":"{Briefly describe language, can efficient, additional obstacles presents.}","code":""},{"path":"automation.html","id":"automation-ssis","chapter":"10 Automation & Reproducibility","heading":"10.1.3 SSIS","text":"{Describe SSIS package development.}","code":""},{"path":"automation.html","id":"automation-scheduling","chapter":"10 Automation & Reproducibility","heading":"10.2 Scheduling","text":"","code":""},{"path":"automation.html","id":"automation-cron","chapter":"10 Automation & Reproducibility","heading":"10.2.1 cron","text":"cron common choice scheduling tasks Linux. plain text file specifies file run, recurring schedule. lot helpful documentation tutorials exists, well sites help construct validate entries like crontab guru.","code":""},{"path":"automation.html","id":"automation-task-scheduler","chapter":"10 Automation & Reproducibility","heading":"10.2.2 Task Scheduler","text":"Windows Task Scheduler common choice scheduling tasks Windows.Many GUI options easy specify, three error-prone, must specified carefully. exist “Actions” | “Start program”.Program/script: absolute path Rscript.exe. needs updated every time upgrade R (unless ’re something tricky PATH environmental OS variable). Notice using “patched” version R. entry enclosed quotes.\n\"C:\\Program Files\\R\\R-4.1.1patched\\bin\\Rscript.exe\"Program/script: absolute path Rscript.exe. needs updated every time upgrade R (unless ’re something tricky PATH environmental OS variable). Notice using “patched” version R. entry enclosed quotes.Add arguments (optional): specifies flow file run. case, repo ‘butcher-hearing-screen-1’ ’Documents/cdw/` directory; flow file located repo’s root directory, discussed prototypical repo. entry enclosed quotes.\n\"C:\\Users\\wbeasley\\Documents\\cdw\\butcher-hearing-screen-1\\flow.R\"Add arguments (optional): specifies flow file run. case, repo ‘butcher-hearing-screen-1’ ’Documents/cdw/` directory; flow file located repo’s root directory, discussed prototypical repo. entry enclosed quotes.Start (optional): sets working directory. properly set, relative paths files point correct locations. identical entry , () include ‘/flow.R’ (b) contains quotes.\nC:\\Users\\wbeasley\\Documents\\cdw\\butcher-hearing-screen-1Start (optional): sets working directory. properly set, relative paths files point correct locations. identical entry , () include ‘/flow.R’ (b) contains quotes.options typically specify :\nSelect “Run whether user logged .”\n\nConfigure highest available version Windows, using dropdown box.\n\n“Wake computer run task” probably necessary located normal desktop. something specify, tasks located VM-based workstation never turned .\nFollowing instructions, required enter password every time modify task, every time update password. using network credentials, probably specify account like “domain/username”. careful: modify task prompted password, GUI may subtly alter account entry just “username” (instead “domain”). Make sure prepend username domain, enter password.10+ tasks, consider creating System Environment Variable called %rscript_path% whose value something like \"C:\\Program Files\\R\\R-4.1.1patched\\bin\\Rscript.exe\". text %rscript_path% goes step one (“Program/script” ). R updated every months, need change path one place (.e., Environment Variables GUI) instead task, requires repeatedly re-entering username password. defined tasks differently describe , may need restart machine load fresh variable value Task Scheduler environment.code executed task scheduler accesses network drive file share, path naturally reference mapped letter. easiest solution spell full path. instance Python/R code, replace “Q:/subdirectory/hospital-location.csv” “//server-name/data-files/subdirectory/hospital-location.csv”.","code":"\"C:\\Program Files\\R\\R-4.1.1patched\\bin\\Rscript.exe\"\"C:\\Users\\wbeasley\\Documents\\cdw\\butcher-hearing-screen-1\\flow.R\"C:\\Users\\wbeasley\\Documents\\cdw\\butcher-hearing-screen-1"},{"path":"automation.html","id":"automation-sql-server-agent","chapter":"10 Automation & Reproducibility","heading":"10.2.3 SQL Server Agent","text":"SQL Server Agent executes jobs specified schedule. also naturally interfaces SSIS packages deployed server, can also execute formats, like plain sql file.important distinction runs service database server, opposed Task Scheduler, runs service client machine. prefer running jobs server job either:requires elevated/administrative privileges (instance, access sensitive data),require lot network constraints passing large amounts data server client, orfeels like server’s responsibility, rebuilding database index, archiving server logs.","code":""},{"path":"automation.html","id":"auxiliary-issues","chapter":"10 Automation & Reproducibility","heading":"10.3 Auxiliary Issues","text":"following subsections execute schedule code, considered.","code":""},{"path":"automation.html","id":"sink-log-files","chapter":"10 Automation & Reproducibility","heading":"10.3.1 Sink Log Files","text":"{Describe sink output file can examined easily.}","code":""},{"path":"automation.html","id":"package-versions","chapter":"10 Automation & Reproducibility","heading":"10.3.2 Package Versions","text":"project runs repeatedly schedule without human intervention, errors can easily go undetected simple systems. , error messages may clear running procedure RStudio. reasons, plan strategy maintaining version R packages. three approaches tradeoffs.conventional projects, keep packages date, live occasional breaks time. ’s time update packages week, () run daily reports morning, (b) update packages (R & RStudio necessary), (c) rereun reports, finally (d) verify results & c . something different, day adapt pipeline code breaking changes packages.\nupdating package, read NEWS file changes backwards-compatible (commonly called “breaking changes” news file).\nchanges pipeline code difficult complete day, can roll back previous version remotes::install_version().conventional projects, keep packages date, live occasional breaks time. ’s time update packages week, () run daily reports morning, (b) update packages (R & RStudio necessary), (c) rereun reports, finally (d) verify results & c . something different, day adapt pipeline code breaking changes packages.updating package, read NEWS file changes backwards-compatible (commonly called “breaking changes” news file).changes pipeline code difficult complete day, can roll back previous version remotes::install_version().side spectrum, can meticulously specify desired version R package. approach reduces chance new version package breaking existing pipeline code. recommend approach uptime important.\nintuitive implementation install explicit code file like utility/install-dependencies.R:\n\nremotes::install_version(\"dplyr\"     , version = \"0.4.3\" )\nremotes::install_version(\"ggplot2\"   , version = \"2.0.0\" )\nremotes::install_version(\"data.table\", version = \"1.10.4\")\nremotes::install_version(\"lubridate\" , version = \"1.6.0\" )\nremotes::install_version(\"openxlsx\"  , version = \"4.0.17\")\n# ... package list continues ...\nAnother implementation convert repo package , specify versions DESCRIPTION file.\nImports:\n   dplyr       (== 0.4.3 )\n   ggplot2     (== 2.0.0 )\n   data.table  (== 1.10.4)\n   lubridate   (== 1.6.0 )\n   openxlsx    (== 4.0.17)\ndownside can difficult set identical machine months. Sometimes packages depend package version incompatible package versions. example, one point, current version dplyr 0.4.3. months later, rlang package (wasn’t explicitly specified list 42 packages) required least version 0.8.0 dplyr. developer new machine needs decide whether upgrade dplyr (test breaking changes pipeline) install older version rlang.\nsecond important downside approach can lock user’s projects specific outdated package version.\nothers8 advocate approach team experienced R, machine dedicated important line--business workflow.\nuptime important team experienced languages like Java, Python, C#, consider better suited.side spectrum, can meticulously specify desired version R package. approach reduces chance new version package breaking existing pipeline code. recommend approach uptime important.intuitive implementation install explicit code file like utility/install-dependencies.R:Another implementation convert repo package , specify versions DESCRIPTION file.downside can difficult set identical machine months. Sometimes packages depend package version incompatible package versions. example, one point, current version dplyr 0.4.3. months later, rlang package (wasn’t explicitly specified list 42 packages) required least version 0.8.0 dplyr. developer new machine needs decide whether upgrade dplyr (test breaking changes pipeline) install older version rlang.second important downside approach can lock user’s projects specific outdated package version.others8 advocate approach team experienced R, machine dedicated important line--business workflow.uptime important team experienced languages like Java, Python, C#, consider better suited.compromise two previous approaches renv package - R Environmentals. successor packrat. requires learning cognitive overhead. investment becomes appealing () running hourly predictions downtime big deal, (b) machine contains multiple projects require different versions package (dplyr 0.4.3 dplyr 0.8.0).compromise two previous approaches renv package - R Environmentals. successor packrat. requires learning cognitive overhead. investment becomes appealing () running hourly predictions downtime big deal, (b) machine contains multiple projects require different versions package (dplyr 0.4.3 dplyr 0.8.0).","code":"\nremotes::install_version(\"dplyr\"     , version = \"0.4.3\" )\nremotes::install_version(\"ggplot2\"   , version = \"2.0.0\" )\nremotes::install_version(\"data.table\", version = \"1.10.4\")\nremotes::install_version(\"lubridate\" , version = \"1.6.0\" )\nremotes::install_version(\"openxlsx\"  , version = \"4.0.17\")\n# ... package list continues ...Imports:\n   dplyr       (== 0.4.3 )\n   ggplot2     (== 2.0.0 )\n   data.table  (== 1.10.4)\n   lubridate   (== 1.6.0 )\n   openxlsx    (== 4.0.17)"},{"path":"scaling-up.html","id":"scaling-up","chapter":"11 Scaling Up","heading":"11 Scaling Up","text":"","code":""},{"path":"scaling-up.html","id":"data-storage","chapter":"11 Scaling Up","heading":"11.1 Data Storage","text":"Local File vs Conventional Database vs RedshiftUsage Cases","code":""},{"path":"scaling-up.html","id":"data-processing","chapter":"11 Scaling Up","heading":"11.2 Data Processing","text":"R vs SQLR vs Spark","code":""},{"path":"collaboration.html","id":"collaboration","chapter":"12 Parallel Collaboration","heading":"12 Parallel Collaboration","text":"","code":""},{"path":"collaboration.html","id":"social-contract","chapter":"12 Parallel Collaboration","heading":"12.1 Social Contract","text":"IssuesOrganized Commits & Coherent DiffsBranch & Merge Strategy","code":""},{"path":"collaboration.html","id":"code-reviews","chapter":"12 Parallel Collaboration","heading":"12.2 Code Reviews","text":"Daily Reviews PRsPeriodic Reviews Files","code":""},{"path":"collaboration.html","id":"remote","chapter":"12 Parallel Collaboration","heading":"12.3 Remote","text":"Headset & sharing screens","code":""},{"path":"collaboration.html","id":"additional-resources-1","chapter":"12 Parallel Collaboration","heading":"12.4 Additional Resources","text":"(Colin Gillespie 2017), particularly “Efficient collaboration” chapter.(Brian Fitzpatrick 2012)","code":""},{"path":"collaboration.html","id":"loose-notes","chapter":"12 Parallel Collaboration","heading":"12.5 Loose Notes","text":"","code":""},{"path":"collaboration.html","id":"github","chapter":"12 Parallel Collaboration","heading":"12.5.1 GitHub","text":"Review diffs committing. Check things like accidental deletions debugging code deleted (least commented ).Review diffs committing. Check things like accidental deletions debugging code deleted (least commented ).Keep chatter minimum, especially projects 3+ people notified every issue post.Keep chatter minimum, especially projects 3+ people notified every issue post.encountering problem,\nTake much ownership reasonable. Don’t merely report ’s error.\ncan’t figure , ask question describe well.\nlow-level file & line code threw error.\ntried solve .\n\n’s questionable line/chunk code, trace origin. sake pointing finger someone, sake understanding origin history.\nencountering problem,Take much ownership reasonable. Don’t merely report ’s error.can’t figure , ask question describe well.\nlow-level file & line code threw error.\ntried solve .\nlow-level file & line code threw error.tried solve .’s questionable line/chunk code, trace origin. sake pointing finger someone, sake understanding origin history.","code":""},{"path":"collaboration.html","id":"common-code","chapter":"12 Parallel Collaboration","heading":"12.5.2 Common Code","text":"involves code/files multiple people use, like REDCap arches.Run file committing . Run common downstream files (e.g., make change arch, also run funnel).upstream variable name must change, alert people. Post GitHub issue announce . Tell everyone, search repo (ctrl+shift+f RStudio) alert specific people might affected.","code":""},{"path":"document.html","id":"document","chapter":"13 Documentation","heading":"13 Documentation","text":"","code":""},{"path":"document.html","id":"team-wide","chapter":"13 Documentation","heading":"13.1 Team-wide","text":"","code":""},{"path":"document.html","id":"project-specific","chapter":"13 Documentation","heading":"13.2 Project-specific","text":"","code":""},{"path":"document.html","id":"dataset-origin-structure","chapter":"13 Documentation","heading":"13.3 Dataset Origin & Structure","text":"","code":""},{"path":"document.html","id":"document-issues","chapter":"13 Documentation","heading":"13.4 Issues & Tasks","text":"","code":""},{"path":"document.html","id":"documentation-issue-template","chapter":"13 Documentation","heading":"13.4.1 GitHub Issue Template","text":"going open repo/package public, consider creating template GitHub Issues ’s tailored repo’s unique characteristics. Furthermore, invite feedback user base improve template. appeal REDCapR produced Unexpected Behavior issue template:@nutterb @haozhu233, @rparrish, @sybandrew, one else, time, please look new issue template customized REDCapR/redcapAPI. ’d appreciate feedback improve experience someone encountering problem.’d like something () make easier user provide useful information less effort (b) make easier us help accurately fewer back--forths. template happens help user identify solve problem without creating issue …think everyone happier .think issue leverage Troubleshooter 10+ people contributed . help locate problematic area quickly.@haozhu233, seems ’ve liked template kableExtra. REDCapR different sense ’s difficult provide minimal & self-contained example reproduce problem. experience many users issues, ’d love advice.@nutterb, ’d like template helpful redcapAPI . three quick find--replace occurrences ‘REDCapR’ -> ‘redcapAPI’. mostly distinguish R package REDCap .","code":""},{"path":"document.html","id":"flow-diagrams","chapter":"13 Documentation","heading":"13.5 Flow Diagrams","text":"","code":""},{"path":"document.html","id":"document-workstation","chapter":"13 Documentation","heading":"13.6 Setting up new machine","text":"Thoroughly describe programs configuration settings team follow. Feel free adapt list needs.’ll see handful benefits:New hires productive sooner, able spend time conceptual issues instead walking tedious installation issues.New hires productive sooner, able spend time conceptual issues instead walking tedious installation issues.everyone team similar environment, easier share code. quality code hopefully improves everyone can leverage others contributions.everyone team similar environment, easier share code. quality code hopefully improves everyone can leverage others contributions.Sometimes department reluctant grant admin rights, especially new users. likely trust team installation documentation demonstrates thought carefully issues. Typically users just need programs like Office Adobe; may realize many tools used well-round data scientist.\nstill reluctant grant admin privileges, make sure realize () takes ~45 minutes install ~12 programs fresh machine, (b) many programs updated every months, (c) data scientist typically installs 5+ R packages month explore tools stay current field. Installing maintaining everyone’s workstation require significant amount time. team willing help alleviate burden maintain software.Sometimes department reluctant grant admin rights, especially new users. likely trust team installation documentation demonstrates thought carefully issues. Typically users just need programs like Office Adobe; may realize many tools used well-round data scientist.still reluctant grant admin privileges, make sure realize () takes ~45 minutes install ~12 programs fresh machine, (b) many programs updated every months, (c) data scientist typically installs 5+ R packages month explore tools stay current field. Installing maintaining everyone’s workstation require significant amount time. team willing help alleviate burden maintain software.","code":""},{"path":"document.html","id":"document-mechanics","chapter":"13 Documentation","heading":"13.7 Documenting with Markdown in a GitHub Repo","text":"quick demo walks https://national-covid-cohort-collaborative.github.io/book--n3c-v1/Select correct file repo.","code":""},{"path":"style.html","id":"style","chapter":"14 Style Guide","heading":"14 Style Guide","text":"Using consistent style across projects can increase overhead data science team discusses options, decides good choice, develops compliant code. like themes document, cost worth effort. Unforced code errors reduced code consistent, mistake-prone styles apparent.part, team follows tidyverse style. additional conventions attempt follow. Many inspired (Francesco Balena 2005).","code":""},{"path":"style.html","id":"readability","chapter":"14 Style Guide","heading":"14.1 Readability","text":"","code":""},{"path":"style.html","id":"style-number","chapter":"14 Style Guide","heading":"14.1.1 Number","text":"word “number” ambiguous, especially data science. Try specific terms:count: number discrete objects events, visit_count, pt_count, dx_count.id: value uniquely identifies entity doesn’t change time, pt_id, clinic_id, client_id,index: 1-based sequence ’s typically temporary, unique within dataset. instance, pt_index 195 Tuesday’s dataset like;y different person pt_index 195 Wednesday. given day, one value 195.tag: persistent across time like “id”, typically created analysts send research team. See snippet appendix example.tally: running countduration: length time. Specify units self-evident.physical statistical quantities like\n“depth”,\n“length”,\n“mass”,\n“mean”, \n“sum”.","code":""},{"path":"style.html","id":"style-abbreviation","chapter":"14 Style Guide","heading":"14.1.2 Abbreviations","text":"Try avoid abbreviations. Different people tend shorten words differently; variability increases chance people reference wrong variable. least, wastes time trying remember subject_number, subject_num, subject_no used. Consistency section describes can reduce errors increase efficiency.However, terms long reasonably use without shortening. make exceptions, following scenarios:humans commonly use term orally. instance, people tend say “” instead “operating room”.humans commonly use term orally. instance, people tend say “” instead “operating room”.team agreed set list abbreviations. list CDW team includes:\nappt (“apt”),\ncdw,\ncpt,\ndrg (stands diagnosis-related group),\ndx,\nhx,\nicd\npt, \nvr (vital records).team agreed set list abbreviations. list CDW team includes:\nappt (“apt”),\ncdw,\ncpt,\ndrg (stands diagnosis-related group),\ndx,\nhx,\nicd\npt, \nvr (vital records).team choose terms (e.g., ‘apt’ vs ‘appt’), try use standard vocabulary, MedTerms Medical Dictionary.","code":""},{"path":"style.html","id":"style-datasets","chapter":"14 Style Guide","heading":"14.2 Datasets","text":"","code":""},{"path":"style.html","id":"style-datasets-filter","chapter":"14 Style Guide","heading":"14.2.1 Filtering Rows","text":"Removing datasets rows important operation frequent source sneaky errors. practices reduce mistakes improve maintainability.","code":""},{"path":"style.html","id":"style-datasets-filter-number-line","chapter":"14 Style Guide","heading":"14.2.1.1 Mimic number line","text":"ordering quantities, go smallest--largest type left--right. minimum consistent direction. words, use operators like < <= avoid > >=. approach also makes consistent SQL dplyr function, ().","code":"\n# Good (b/c quantities increase as you read left-to-right)\nds_teenager |>\n  dplyr::filter(13 <= age & age < 20)\n\n# Not as good (b/c quantities increase as you read right-to-left)\nds_teenager |>\n  dplyr::filter(20 > age & age <= 13)\n\n# Bad (b/c the order is inconsistent)\nds_teenager |>\n  dplyr::filter(age >= 13 & age < 20)\nds_teenager |>\n  dplyr::filter(age < 20 & age >= 13)"},{"path":"style.html","id":"style-datasets-filter-searchable","chapter":"14 Style Guide","heading":"14.2.1.2 Searchable verbs","text":"’ve occasionally asked frustration, “dataset lose rows? 900 rows middle script, now 782.” scan script location potentially removes rows. locations easier identify ’re scanning small set filtering functions \ntidyr::drop_na(),\ndplyr::filter(), \ndplyr::summarize(). can even highlight ‘ctrl+f’. contrast, base R’s filtering style difficult identify.","code":"\n# tidyverse's approach is easy to see in a long script\nds <-\n  ds |>\n  dplyr::filter(4 <= count)\n  \n# base R's approach is harder to see\nds <- ds[4 <= ds$count, ]"},{"path":"style.html","id":"style-datasets-filter-drop_na","chapter":"14 Style Guide","heading":"14.2.1.3 Remove rows with missing values","text":"Even within tidyverse functions, preferences certain scenarios. entry covers scenario dropping entire row important column missing value.tidyr::drop_na() removes rows missing value specific column. cleaner read write dplyr’s filter() base R’s subsetting bracket. particular, ’s easy forget/overlook !.","code":"\n# Cleanest\nds |>\n  tidyr::drop_na(dob)\n\n# Not as good\nds |>\n  dplyr::filter(!is.na(dob))\n\n# Ripest for mistakes or misinterpretation\nds[!is.na(ds$dob), ]"},{"path":"style.html","id":"style-datasets-attach","chapter":"14 Style Guide","heading":"14.2.2 Don’t attach","text":"Google Stylesheet says, “possibilities creating errors using attach() numerous.”Hopefully ’ve learned R recently enough haven’t read examples 1990s used attach(). may made sense early days S-PLUS language used primarily interactively single statistician. contemporary tradeoffs unfavorable, now R scripts frequently run multiple people functions run multiple contexts.","code":""},{"path":"style.html","id":"style-factor","chapter":"14 Style Guide","heading":"14.3 Categorical Variables","text":"lots names categorical variable across different disciplines (e.g., factor, categorical, …).","code":""},{"path":"style.html","id":"style-factor-unknown","chapter":"14 Style Guide","heading":"14.3.1 Explicit Missing Values","text":"Define level like \"unknown\" data manipulation doesn’t test .na(x) x == \"unknown\". explicit label also helps included statistical procedure coefficient table.","code":""},{"path":"style.html","id":"style-factor-granularity","chapter":"14 Style Guide","heading":"14.3.2 Granularity","text":"Sometimes helps represent values differently, say granular variable coarse variable. two related variables 7 3 levels respectively, say *_cut7 *_cut3 denote resolution; related base::cut(). Don’t forget include “unknown” “” necessary.dplyr::recode_factor() ideal replacement scenario , single call combines work dplyr::recode() base::factor(). Just make sure recoding order represents desired order factor levels.","code":"# Inside a dplyr::mutate() clause\neducation_cut7      = dplyr::recode(\n  education_cut7,\n  \"No Highschool Degree / GED\"  = \"no diploma\",\n  \"High School Degree / GED\"    = \"diploma\",\n  \"Some College\"                = \"some college\",\n  \"Associate's Degree\"          = \"associate\",\n  \"Bachelor's Degree\"           = \"bachelor\",\n  \"Post-graduate degree\"        = \"post-grad\",\n  \"Unknown\"                     = \"unknown\",\n  .missing                      = \"unknown\",\n),\neducation_cut3      = dplyr::recode(\n  education_cut7,\n  \"no diploma\"    = \"no bachelor\",\n  \"diploma\"       = \"no bachelor\",\n  \"some college\"  = \"no bachelor\",\n  \"associate\"     = \"no bachelor\",\n  \"bachelor\"      = \"bachelor\",\n  \"post-grad\"     = \"bachelor\",\n  \"unknown\"       = \"unknown\",\n),\neducation_cut7 = factor(education_cut7, levels=c(\n  \"no diploma\",\n  \"diploma\",\n  \"some college\",\n  \"associate\",\n  \"bachelor\",\n  \"post-grad\",\n  \"unknown\"\n)),\neducation_cut3 = factor(education_cut3, levels=c(\n  \"no bachelor\",\n  \"bachelor\",\n  \"unknown\"\n)),# Inside a dplyr::mutate() clause\neducation_cut7      = dplyr::recode_factor(\n  education_cut7,\n  \"No Highschool Degree / GED\"  = \"no diploma\",\n  \"High School Degree / GED\"    = \"diploma\",\n  \"Some College\"                = \"some college\",\n  \"Associate's Degree\"          = \"associate\",\n  \"Bachelor's Degree\"           = \"bachelor\",\n  \"Post-graduate degree\"        = \"post-grad\",\n  \"Unknown\"                     = \"unknown\",\n  .missing                      = \"unknown\",\n),\neducation_cut3      = dplyr::recode_factor(\n  education_cut7,\n  \"no diploma\"    = \"no bachelor\",\n  \"diploma\"       = \"no bachelor\",\n  \"some college\"  = \"no bachelor\",\n  \"associate\"     = \"no bachelor\",\n  \"bachelor\"      = \"bachelor\",\n  \"post-grad\"     = \"bachelor\",\n  \"unknown\"       = \"unknown\",\n),"},{"path":"style.html","id":"style-dates","chapter":"14 Style Guide","heading":"14.4 Dates","text":"Date arithmetic hard. Naming dates well might harder.birth_month_index can values 1 12, birth_month (commonly mob) contains year (e.g., 2014-07-15).birth_month_index can values 1 12, birth_month (commonly mob) contains year (e.g., 2014-07-15).birth_year integer, birth_month birth_week dates. Typically months collapsed 15th day weeks collapsed Monday, defaults OuhscMunge::clump_month_date() OuhscMunge::clump_week_date(). obfuscate real value PHI involved. Months centered midpoint usually better representation month’s performance month’s initial day.birth_year integer, birth_month birth_week dates. Typically months collapsed 15th day weeks collapsed Monday, defaults OuhscMunge::clump_month_date() OuhscMunge::clump_week_date(). obfuscate real value PHI involved. Months centered midpoint usually better representation month’s performance month’s initial day.Don’t use minus operator (.e., -). See Defensive Date Arithmetic.Don’t use minus operator (.e., -). See Defensive Date Arithmetic.","code":""},{"path":"style.html","id":"style-naming","chapter":"14 Style Guide","heading":"14.5 Naming","text":"","code":""},{"path":"style.html","id":"style-naming-variables","chapter":"14 Style Guide","heading":"14.5.1 Variables","text":"builds upon tidyverse style guide objects.","code":""},{"path":"style.html","id":"style-naming-variables-characters","chapter":"14 Style Guide","heading":"14.5.1.1 Characters","text":"Use lowercase letters, using underscores separate words. Avoid uppercase letters periods.","code":""},{"path":"style.html","id":"style-naming-semantic","chapter":"14 Style Guide","heading":"14.5.2 Semantic Order","text":"variables including multiple nouns adjectives, place global terms microscopic terms. “bigger” term goes first; “smaller” terms successively nested bigger terms.Large datasets multiple questionnaires (multiple subsections) much manageable variables follow semantic order.don’t know picked term “semantic order”. may come Semantic Versioning software releases.","code":"\n# Good:\nparent_name_last\nparent_name_first\nparent_dob\nkid_name_last\nkid_name_first\nkid_dob\n\n# Bad:\nlast_name_parent\nfirst_name_parent\ndob_parent\nlast_name_kid\nfirst_name_kid\ndob_kidSELECT\n  asq3_medical_problems_01\n  ,asq3_medical_problems_02\n  ,asq3_medical_problems_03\n  ,asq3_behavior_concerns_01\n  ,asq3_behavior_concerns_02\n  ,asq3_behavior_concerns_03\n  ,asq3_worry_01\n  ,asq3_worry_02\n  ,asq3_worry_03\n  ,wai_01_steps_beneficial\n  ,wai_02_hv_useful\n  ,wai_03_parent_likes_me\n  ,wai_04_hv_doubts\n  ,hri_01_client_input\n  ,hri_02_problems_discussed\n  ,hri_03_addressing_problems_clarity\n  ,hri_04_goals_discussed\nFROM miechv.gpav_3"},{"path":"style.html","id":"style-naming-files","chapter":"14 Style Guide","heading":"14.5.3 Files and Folders","text":"Naming files folders/directories follows style naming variables, one small difference: separate words dashes (.e., -), underscores (.e., _). words, “kebab case” instead “snake case.Occasionally, ’ll use dash helps identify noun (already contains underscore). instance, ’s table called patient_demographics, might call files patient_demographics-truncate.sql patient_demographics-insert.sql.Using lower case important databases operating systems case-sensitive, case-insensitive. promote portability, keep everything lowercase., file folder names contain () lowercase letters, (b) digits, (c) dashes, (d) occasional dash. include spaces, uppercase letters, especially punctuation, : (.","code":""},{"path":"style.html","id":"style-naming-datasets","chapter":"14 Style Guide","heading":"14.5.4 Datasets","text":"tibbles (fancy data.frames) used almost every analysis file, put extra effort formulating conventions informative consistent. Naming datasets follows style naming variables, additional features.R world, “dataset” typically synonym data.frame –rectangular structure rows columns. database equivalent conventional table. Note “dataset” means collections tables .NET world, collection (-necessarily-rectangular) files Dataverse.9","code":""},{"path":"style.html","id":"style-naming-datasets-prefix","chapter":"14 Style Guide","heading":"14.5.4.1 Prefix with ds_ and d_","text":"Datasets handled differently variables find ’s easier identify type scope. prefix ds_ indicates dataset available entire file, d_ indicates scope localized function.","code":"\ncount_elements <- function (d) {\n  nrow(d) * ncol(d)\n}\n\nds <- mtcars\ncount_elements(d = ds)"},{"path":"style.html","id":"style-naming-datasets-grain","chapter":"14 Style Guide","heading":"14.5.4.2 Express the grain","text":"grain dataset describes row represents, similar idea statistician’s concept “unit analysis”. Essentially granular entity described. Many miscommunications silly mistakes avoided team disciplined enough define tidy dataset clear grain.insight grains, Ralph Kimball writesIn debugging literally thousands dimensional designs students years, found frequent design error far declaring grain fact table beginning design process. grain isn’t clearly defined, whole design rests quicksand. Discussions candidate dimensions go around circles, rogue facts introduce application errors sneak design.\n…\nhope ’ve noticed powerful effects declaring grain. First, can visualize dimensionality doctor bill line item precisely, can therefore confidently examine data sources, deciding whether dimension can attached data. example, probably exclude “treatment outcome” example medical billing data doesn’t tie notion outcome.","code":"\nds_student          # One row per student\nds_teacher          # One row per teacher\nds_course           # One row per course\nds_course_student   # One row per student-course combination\nds_pt         # One row per patient\nds_pt_visit   # One row per patient-visit combination\nds_visit      # Same as above, since it's clear a visit is connected w/ a pt"},{"path":"style.html","id":"style-naming-datasets-singular","chapter":"14 Style Guide","heading":"14.5.4.3 Singular table names","text":"adopt style table’s name reflects grain, corollary. grain singular like “one row per client” “one row per building”, name ds_client ds_building (ds_clients ds_buildings). datasets saved database, tables called client building.Table names plural grain plural. record field like client_id, date_birth, date_graduation date_death, suggest called table client_milestones (single row contains three milestones).Stack Overflow post presents variety opinions justifications adopting singular plural naming scheme.think ’s acceptable R vectors follow different style R data.frames. instance, vector can plural name even though element singular (e.g., client_ids <- c(10, 24, 25)).","code":""},{"path":"style.html","id":"style-naming-datasets-ds-only","chapter":"14 Style Guide","heading":"14.5.4.4 Use ds when definition is clear","text":"Many times ellis file handles one incoming csv outgoing dataset, grain obvious –typically ellis filename clearly states grain.case, R script can use just ds instead ds_county.","code":""},{"path":"style.html","id":"style-naming-datasets-adjective","chapter":"14 Style Guide","heading":"14.5.4.5 Use an adjective after the grain, if necessary","text":"R file manipulating two datasets grain, qualify differences grain, ds_client_all ds_client_michigan. Adjectives commonly indicate one dataset subset another.occasional limitation naming scheme difficult distinguish grain adjective. instance, grain ds_student_enroll either () every instance student enrollment (.e., student enroll describe grain) (b) subset students enrolled (.e., student grain enroll adjective)? ’s clear without examine code, comments, documentation.someone solution, love hear . far, ’ve reluctant decorate variable name , ds_grain_client_adj_enroll.","code":""},{"path":"style.html","id":"style-naming-datasets-define","chapter":"14 Style Guide","heading":"14.5.4.6 Define the dataset when in doubt","text":"’s potentially unclear new reader, use comment immediately dataset’s initial use. grain frequently important characteristic document.","code":"\n# `ds_client_enroll`:\n#    grain: one row per client\n#    subset: only clients who have successfully enrolled are included\n#    source: the `client` database table, where `enroll_count` is 1+.\nds_client_enroll <- ..."},{"path":"style.html","id":"style-whitespace","chapter":"14 Style Guide","heading":"14.6 Whitespace","text":"Although execution rarely affected whitespace R SQL files, consistent minimalistic. One benefit Git diffs won’t show unnecessary churn. line code lights diff, ’s nice reflect real change, something trivial like tabs converted spaces, trailing spaces added deleted.guidelines handled automatically modern IDEs, configure correct settings.Tabs replaced spaces. modern IDEs option automatically. (RStudio calls “Insert spaces tabs”.)Indentions replaced consistent number spaces, depending file type.\nR: 2 spaces\nSQL: 2 spaces\nPython: 4 spaces\nR: 2 spacesSQL: 2 spacesPython: 4 spacesEach file end blank line. (RStudio checkbox “Ensure source files end newline.”)Remove spaces tabs end lines.\nVS Code: see VS Code section Workstation chapter.\nAzure Data Studio: See ADS section Workstation chapter.\nRStudio: Global Options | Code | Saving | Strip trailing horizontal whitespace saving.\nSSMS:\nVS Code: see VS Code section Workstation chapter.Azure Data Studio: See ADS section Workstation chapter.RStudio: Global Options | Code | Saving | Strip trailing horizontal whitespace saving.SSMS:","code":""},{"path":"style.html","id":"style-database","chapter":"14 Style Guide","heading":"14.7 Database","text":"GitLab’s data team good style guide databases sql ’s fairly consistent style. important additions differences areFavor CTEs subqueries ’re easier follow can reused file. performance problem, slightly rewrite CTE temp table see new indexes help.\nResources:\nBrent Ozar’s SQL Server Common Table Expressions defines basics:\n\nCTE effectively creates temporary view developer can reference multiple times underlying query.\n\nBrent Ozar’s ’s Better, CTEs Temp Tables? article’s bottom line :\n\n’d suggest starting CTEs ’re easy write read. hit performance wall, try ripping CTE writing temp table, joining temp table.\n\nFavor CTEs subqueries ’re easier follow can reused file. performance problem, slightly rewrite CTE temp table see new indexes help.Resources:Brent Ozar’s SQL Server Common Table Expressions defines basics:\n\nCTE effectively creates temporary view developer can reference multiple times underlying query.\nBrent Ozar’s SQL Server Common Table Expressions defines basics:CTE effectively creates temporary view developer can reference multiple times underlying query.Brent Ozar’s ’s Better, CTEs Temp Tables? article’s bottom line :\n\n’d suggest starting CTEs ’re easy write read. hit performance wall, try ripping CTE writing temp table, joining temp table.\nBrent Ozar’s ’s Better, CTEs Temp Tables? article’s bottom line :’d suggest starting CTEs ’re easy write read. hit performance wall, try ripping CTE writing temp table, joining temp table.name primary key typically contain table. employee table, key employee_id, id.name primary key typically contain table. employee table, key employee_id, id.","code":""},{"path":"style.html","id":"style-repo","chapter":"14 Style Guide","heading":"14.8 Code Repositories","text":"analytical team dedicates private repo research project. repository GitHub accessible team members granted explicit privileges. Repos also discussed Git & GitHub appendix.","code":""},{"path":"style.html","id":"style-repo-naming","chapter":"14 Style Guide","heading":"14.8.1 Repo Naming","text":"2022, GitHub organization 300 repos. Many focused warehouse projects completed within month. easiest stable naming system ’ve found built three parts:PI’s last name. Even contact project manager, prefer use primary investigator’s name (typically name IRB application) rarely changes easier trace right team. refer medical resident fellow rotate months.Two three word term. Describe global area words.Index. optimistic prepare follow investigations. initial repo “…-1”, subsequent repos “…-2, …-3, …-4”.informally call “project tag” try use consistently different arenas, :GitHub repo’s name.parent directory project file server (e.g., M:/pediatrics/bbmc/akande-covid-1).database schema containing project’s tables (e.g., akande_covid_1.patient, akande_covid_1.visit, akande_covid_2.visit). Change kebab case snake case (e.g., akande-covid-1 akande_covid_1) sql code doesn’t escape schema name brackets.body emails help retrospective searches.","code":"\n# Good Examples\nakande-asthma-hospitalization-1\nakande-asthma-hospitalization-2\nakande-covid-1\nbard-covid-1\nbard-covid-2\nbard-eeg-education-1\n\n# Bad Examples\nakande-1\nakande-2\ncovid-1\ncovid-2\ncovid-3\nbard-research-1"},{"path":"style.html","id":"style-repo-granularity","chapter":"14 Style Guide","heading":"14.8.2 Repo Granularity","text":"boundaries research project may fuzzy, may clear answer question, “considered one large research project one repo, two smaller research projects two total repos?”. deciding factor us usually determined amount living code need exist repos. two projects developed parallel make similar changes repos, strongly consider using one repo.issues suggest unified repo:two repos almost identical users.two repos covered IRB.Issues suggest separate repos:development windows don’t overlap. initial project wrapped last year follow-study starting, consider separate repo starts subset code. Start fresh copy ’s necessary","code":""},{"path":"style.html","id":"style-repo-pricing","chapter":"14 Style Guide","heading":"14.8.3 Repo Pricing","text":"enrolled GitHub program 2012 allows academic research group unlimited private repos GitHub Organization. Otherwise, feasible 300+ tightly-focused repos.GitHub seems introduce new programs modify existing branding every years. current best documentation “Apply educator researcher discount”. Notice program lightweight program like “GitHub Campus”, involves whole campus apparently.","code":""},{"path":"style.html","id":"style-ggplot","chapter":"14 Style Guide","heading":"14.9 ggplot2","text":"expressiveness ggplot2 allows someone quickly develop precise scientific graphics. One graph can specified many equivalent styles, increases opportunity confusion. formalized much style writing textbook introductory statistics (Lise DeShea (2015)); 200+ graphs code publicly available.additional ggplot2 tips tidyverse style guide.","code":""},{"path":"style.html","id":"style-ggplot-order","chapter":"14 Style Guide","heading":"14.9.1 Order of commands","text":"ggplot2 essentially collection functions combined + operator. Publication graphs common require least 20 functions, means functions can sometimes redundant step toes. family functions follow consistent order ideally starting important structural functions ending cosmetic functions. preference :ggplot() primary function specify default dataset aesthetic mappings. Many arguments can passed aes(), prefer follow order consistent scale_*() order .geom_*() annotate() creates geometric elements represent data. Unlike categories list, order matters. Geoms specified first drawn first, therefore can obscured subsequent geoms.scale_*() describes dimension data (specified aes()) translated visual element. specify dimensions descending order (typical) importance: x, y, group, color, fill, size, radius, alpha, shape, linetype.coord_*()facet_*() label_*()guides()theme() (call ‘big’ themes like theme_minimal() overriding details like theme(panel.grid = element_line(color = \"gray\")))labs()graph contains typical ggplot2 elements.","code":"ggplot(ds, aes(x = group, y = lift_count, fill = group, color = group)) +\n  geom_bar(stat = \"summary\", fun.y = \"mean\", color = NA) +\n  geom_point(position = position_jitter(w = 0.4, h = 0), shape = 21) +\n  scale_color_manual(values = palette_pregnancy_dark) +\n  scale_fill_manual( values = palette_pregnancy_light) +\n  coord_flip() +\n  facet_wrap(\"time\") +\n  theme_minimal() +\n  theme(legend.position = \"none\") +\n  theme(panel.grid.major.y = element_blank()) +\n  labs(\n    title = \"Lifting by Group across Time\"\n    x     = NULL, \n    y     = \"Number of Lifts\"\n  )"},{"path":"style.html","id":"style-ggplot-gotchas","chapter":"14 Style Guide","heading":"14.9.2 Gotchas","text":"common mistakes see --infrequently (even sometimes ggplot2 code).","code":""},{"path":"style.html","id":"style-ggplot-zoom","chapter":"14 Style Guide","heading":"14.9.2.1 Zooming","text":"Call coord_*() restrict plotted x/y values, scale_*() lims()/xlim()/ylim(). coord_*() zooms axes, extreme values essentially fall page; contrast, latter three functions essentially remove values dataset. distinction matter simple bivariate scatterplot, likely mislead viewer two common scenarios. First, call geom_smooth() (e.g., overlays loess regression curve) ignore extreme values entirely; consequently summary location misplaced standard errors tight. Second, line graph spaghetti plots contains extreme value, sometimes desirable zoom primary area activity; calling coord_*(), trend line leave return plotting panel (implies points exist fit page), yet calling others, trend line appear interrupted, extreme point missing value.","code":""},{"path":"style.html","id":"style-ggplot-seed","chapter":"14 Style Guide","heading":"14.9.2.2 Seed","text":"jittering, set seed ‘declare-globals’ chunk rerunning report won’t create (slightly) different png. insignificantly different pngs consume extra space Git repository. Also, GitHub diff show difference png versions, requires extra subjectivity cognitive load determine difference due solely jittering, something really changed analysis.Occasionally ’ll want multiple graphs report consistent jitter, set seed prior ggplot() call. Lise DeShea’s 2015 book, Figures 3-21, 3-22, 3-23 needed similar possible inter-graph differences easier distinguish.","code":"\n# ---- declare-globals ---------------------------------------------------------\nset.seed(seed = 789) # Set a seed so the jittered graphs are consistent across renders.\n# ---- figure-03-21 ------------------------------------------------------\nset.seed(seed = 789)\nggplot(ds, aes(x = group, y = t1_lifts, fill = group)) +\n...\n\n# ---- figure-03-22 ------------------------------------------------------\nset.seed(seed = 789)\nggplot(ds, aes(x = group, y = t1_lifts, fill = group)) +\n...\n\n# ---- figure-03-23 ------------------------------------------------------\nset.seed(seed = 789)\nggplot(ds, aes(x = group, y = t1_lifts, fill = group)) +\n..."},{"path":"publication.html","id":"publication","chapter":"15 Publishing Results","heading":"15 Publishing Results","text":"","code":""},{"path":"publication.html","id":"publication-analysts","chapter":"15 Publishing Results","heading":"15.1 To Other Analysts","text":"","code":""},{"path":"publication.html","id":"publication-experts","chapter":"15 Publishing Results","heading":"15.2 To Researchers & Content Experts","text":"","code":""},{"path":"publication.html","id":"publication-phobic","chapter":"15 Publishing Results","heading":"15.3 To Technical-Phobic Audiences","text":"","code":""},{"path":"validation.html","id":"validation","chapter":"16 Validation","heading":"16 Validation","text":"","code":""},{"path":"validation.html","id":"validation-intro","chapter":"16 Validation","heading":"16.1 Intro","text":"learn tools efficiently generate informative descriptive reports, time invest almost always pays .Validating dataset serves many beneficial roles, includingexploring basic descriptive patterns,verifying understand variable’s definition,communicating team already understand,describing variation locations time periods,evaluating preliminary hypotheses, andassessing likelihood assumptions inferential models reasonable.","code":""},{"path":"validation.html","id":"validation-ad-hoc","chapter":"16 Validation","heading":"16.2 Ad-hoc Manual Inspections","text":"recommend starting basic question developing quick dirty report addresses immediate need. initial curiosity satisfied, consider report can evolve address future needs. One common evolutionary path report inform inferential model. second common path assimilated automated report frequently run.","code":""},{"path":"validation.html","id":"validation-inferential","chapter":"16 Validation","heading":"16.3 Inferential Support","text":"","code":""},{"path":"validation.html","id":"validation-inferential-background","chapter":"16 Validation","heading":"16.3.1 Brief Intro to Inferential Statistics","text":"Descriptive statistics differ inferential statistics. descriptive statistic concerns observed elements sample, average height range weakest strongest systolic blood pressure. fuzziness forecasting descriptive statistic –’s simply straight-forward equation observed points.10An inferential statistic tries reach beyond descriptive statistic: projects beyond observed sample. assesses pattern within collected sample likely exist larger population. Suppose group 40 newborns tended faster heart rates 33 infants. Stated differently, average 40 newborns faster average 33 infancts. large Student t (accompanying small p-value) may indicate difference exists among babies –just among 73. (Notice ’re comparing average two groups, saying slowest newborn still faster fastest infant)However order conclusions valid, several assumptions must met. See (Lise DeShea 2015) information t-test analyses commonly used health care.sense, t-test resembles broad category inferential statistics: validity assumptions can evaluated research design (e.g., kid measured independently), assumptions best evaluated data (e.g., residuals/errors follow approximate bell-shaped distribution).graphs useful assessing appropriateness inferential statistic:beginners: histogramsfor beginners: scatterplot observedfor beginners: plots residuals (.e., descrepancy point’s observed & predicted value)advanced users, see suite graphs built base RIn words, can help establish foundation justifies inferential statistic.important … comfortable inferential statistic reasonably meet assumptions conclusions valid.","code":""},{"path":"validation.html","id":"automated-reports","chapter":"16 Validation","heading":"16.4 Automated Reports","text":"two strategies (ad-hoc inspections inferential support) can connected. ad-hoc inspection enlightening, consider spending ~15 minutes making report easily reproducible things change. reasons report monitored repeatedly changes inTemporal Trends (e.g., dataset Jan 2020 Dec 2020 looks different Jan 2020 Dec 2022)Inclusion criteria (e.g., restrict list diagnosis code)Data Partner sites (e.g., new site contributes data patterns didn’t anticipate)","code":""},{"path":"testing.html","id":"testing","chapter":"17 Testing","heading":"17 Testing","text":"","code":""},{"path":"testing.html","id":"testing-functions","chapter":"17 Testing","heading":"17.1 Testing Functions","text":"","code":""},{"path":"testing.html","id":"validator","chapter":"17 Testing","heading":"17.2 Validator","text":"Benefits AnalystsBenefits Data Collectors","code":""},{"path":"troubleshooting.html","id":"troubleshooting","chapter":"18 Troubleshooting and Debugging","heading":"18 Troubleshooting and Debugging","text":"","code":""},{"path":"troubleshooting.html","id":"finding-help","chapter":"18 Troubleshooting and Debugging","heading":"18.1 Finding Help","text":"Within group (eg, Thomas REDCap questions)Within university (eg, SCUG)Outside (eg, Stack Overflow; GitHub issues)","code":""},{"path":"troubleshooting.html","id":"debugging","chapter":"18 Troubleshooting and Debugging","heading":"18.2 Debugging","text":"traceback(), browser(), etc","code":""},{"path":"workstation.html","id":"workstation","chapter":"19 Workstation","heading":"19 Workstation","text":"believe important keep software updated consistent across workstations project. material originally posted https://github.com/OuhscBbmc/RedcapExamplesAndPatterns/blob/main/DocumentationGlobal/ResourcesInstallation.md. help establish tools new development computer.","code":""},{"path":"workstation.html","id":"workstation-required","chapter":"19 Workstation","heading":"19.1 Required Installation","text":"installation order matters.","code":""},{"path":"workstation.html","id":"workstation-r","chapter":"19 Workstation","heading":"19.1.1 R","text":"R centerpiece analysis. Every months, ’ll need download recent version. {added Sept 2012}","code":""},{"path":"workstation.html","id":"workstation-rstudio","chapter":"19 Workstation","heading":"19.1.2 RStudio","text":"RStudio Desktop IDE (integrated design interface) ’ll use interact R, GitHub, Markdown. Updates can checked easily menus Help -> Check Updates. {added Sept 2012}Note: non-default changes facilitate workflow. Choose “Global Options” “Tools menu bar.General | Basic | Restore .RData wokspace startup: uncheckedGeneral | Basic | Save workspace >RData exit: neverGeneral | Basic | Always save history: uncheckedCode | Editing | Use native pipe operator, |>: checkedCode | Saving | Ensure source files end newline: checkedCode | Saving | Strip trailing horizontal whitespace saving: checkedSweave | Weave Rnw file using: knitr","code":""},{"path":"workstation.html","id":"workstation-rtools","chapter":"19 Workstation","heading":"19.1.3 R Tools","text":"R Tools Windows necessary build packages development hosted GitHub. running Windows, follow page’s instructions, especially “Putting Rtools PATH” section. running Linux, components R Tools likely already installed machine. {added Feb 2017}","code":""},{"path":"workstation.html","id":"workstation-r-package-installation","chapter":"19 Workstation","heading":"19.1.4 Installing R Packages","text":"Dozens R Packages need installed. Choose one two related scripts. install list packages data analysts typically need. script installs package ’s already installed; also existing package updated newer version available. Create new ‘personal library’ prompts . takes least fifteen minutes, start go lunch. list packages evolve time, please help keep list updated.install frequently-used packages, run following snippet. first lines installs important package. second line calls online Gist11, defines package_janitor_remote() function. function installs packages listed two CSVs, package-dependency-list.csv package-dependency-list-.csv.projects require specialized packages typically used. cases, develop git repo R package includes proper DESCRIPTION file. See RAnalysisSkeleton example.project opened RStudio, update_packages_addin() OuhscMunge find DESCRIPTION file install package dependencies.","code":"\nif (!base::requireNamespace(\"devtools\")) utils::install.packages(\"devtools\")\ndevtools::source_gist(\"2c5e7459b88ec28b9e8fa0c695b15ee3\", filename=\"package-janitor-bbmc.R\")\n\n# Important packages required by most BBMC projects\npackage_janitor_remote(\n  \"https://raw.githubusercontent.com/OuhscBbmc/RedcapExamplesAndPatterns/main/utility/package-dependency-list.csv\"\n)\n\n# Nonessential packages used in a few BBMC projects\npackage_janitor_remote(\n  \"https://raw.githubusercontent.com/OuhscBbmc/RedcapExamplesAndPatterns/main/utility/package-dependency-list-more.csv\"\n)\nif( !base::requireNamespace(\"remotes\"   ) ) utils::install.packages(\"remotes\")\nif( !base::requireNamespace(\"OuhscMunge\") ) remotes::install_github(\"OuhscBbmc/OuhscMunge\")\nOuhscMunge::update_packages_addin()"},{"path":"workstation.html","id":"workstation-r-package-update","chapter":"19 Workstation","heading":"19.1.5 Updating R Packages","text":"Several R packages need updated every weeks. Unless told (break something -rare), periodically update packages executing following code update.packages(checkBuilt = TRUE, ask = FALSE).","code":""},{"path":"workstation.html","id":"workstation-github","chapter":"19 Workstation","heading":"19.1.6 GitHub","text":"GitHub registration necessary push modified files repository. First, register free user account, tell repository owner exact username, add collaborator (e.g., https://github.com/OuhscBbmc/RedcapExamplesAndPatterns). {added Sept 2012}","code":""},{"path":"workstation.html","id":"workstation-github-client","chapter":"19 Workstation","heading":"19.1.7 GitHub Desktop","text":"GitHub Desktop basic tasks little easier git features built RStudio. client available Windows macOS. (Occasionally, someone might need use git command line fix problems, required start.) {added Sept 2012}","code":""},{"path":"workstation.html","id":"workstation-recommended","chapter":"19 Workstation","heading":"19.2 Recommended Installation","text":"installation order matter.","code":""},{"path":"workstation.html","id":"workstation-odbc","chapter":"19 Workstation","heading":"19.2.1 ODBC Driver","text":"ODBC Driver SQL Server connecting token server, institution using one. writing, version 18 recent driver version. See new one exists. {updated Feb 2022}","code":""},{"path":"workstation.html","id":"workstation-quarto","chapter":"19 Workstation","heading":"19.2.2 Quarto","text":"Quarto Posit’s/RStudio’s successor knitr. uses embedded version Pandoc translate R/Python/Julia code html pdf reports (via Markdown). Reporting reproducible research foundation workflow Quarto used upcoming generation reports. existing Rmd file delivering need (something like article federal report), continue using knitr R Markdown. developing new report scratch, strongly consider Quarto. {added Nov 2022}Quarto’s Get Started page instructions. ’ll want installed RStudio IDE, probably VS Code . See troubleshooting tips necessary.","code":""},{"path":"workstation.html","id":"workstation-notepadpp","chapter":"19 Workstation","heading":"19.2.3 Notepad++","text":"Notepad++ text editor allows look raw text files, code CSVs. CSVs data files, helpful troubleshooting (instead looking file Excel, masks & causes issues). {added Sept 2012}","code":""},{"path":"workstation.html","id":"workstation-ads","chapter":"19 Workstation","heading":"19.2.4 Azure Data Studio","text":"Azure Data Studio (ADS) now recommended Microsoft others analysts (roles) –ahead SQL Server Management Studio.Note: non-default changes facilitate workflow.Settings | Text Editor | Tab Size: 2 {\"editor.tabSize\": 2}Settings | Text Editor | Detect Indentation: uncheck {\"editor.detectIndentation\": false}Settings | Text Editor | Insert Final Newlines: check {\"files.insertFinalNewline\": true}Settings | Text Editor | Trim Final Newlines: check {\"files.trimFinalNewlines\": true}Settings | Text Editor | Trim Trailing Whitespace: check {\"files.trimTrailingWhitespace\": true}Data | Sql | Show Connection Info Title: uncheck {\"sql.showConnectionInfoInTitle\": false}Data | Sql | Include Headers: check {\"sql.copyIncludeHeaders\": false}","code":"{\n  \"workbench.enablePreviewFeatures\": true,\n  \"workbench.colorTheme\": \"Default Dark Azure Data Studio\",\n  \"editor.tabSize\": 2,\n  \"editor.detectIndentation\": false,\n  \"files.insertFinalNewline\": true,\n  \"files.trimFinalNewlines\": true,\n  \"files.trimTrailingWhitespace\": true,\n  \"queryEditor.showConnectionInfoInTitle\": false,\n  \"queryEditor.results.copyIncludeHeaders\": false\n}"},{"path":"workstation.html","id":"workstation-vscode","chapter":"19 Workstation","heading":"19.2.5 Visual Studio Code","text":"Visual Studio Code extensible text editor runs Windows Linux. ’s much lighter full Visual Studio. Like Atom, supports browsing directory structure, replacing across files, interaction git, previewing markdown. VS Code good documentation Basic Editing.Productivity VS Code enhanced following extensions: {added Dec 2018}Excel Viewer isn’t good name, ’ve liked capability. displays CSVs files grid. {added Dec 2018}Excel Viewer isn’t good name, ’ve liked capability. displays CSVs files grid. {added Dec 2018}Rainbow CSV color codes columns, still allows see edit raw plain-text file. {added Dec 2018}Rainbow CSV color codes columns, still allows see edit raw plain-text file. {added Dec 2018}SQL Server allows execute database, view/copy/save grid results. doesn’t replicate SSMS features, nice scanning files. {added Dec 2018}SQL Server allows execute database, view/copy/save grid results. doesn’t replicate SSMS features, nice scanning files. {added Dec 2018}Code Spell Checker produces green squiggly lines words dictionary. can add words user dictionary, project dictionary.Code Spell Checker produces green squiggly lines words dictionary. can add words user dictionary, project dictionary.Markdown One useful markdown capabilities, converting file html.Markdown One useful markdown capabilities, converting file html.Markdown PDF useful markdown capabilities, converting file pdf.Markdown PDF useful markdown capabilities, converting file pdf.markdownlint linting style checking.markdownlint linting style checking.extensions can installed command line.Note: non-default changes facilitate workflow. Either copy configuration settings.json, manually specify options settings editor.Settings | Extensions |Markdown One | Ordered List | Auto Renumber: false {\"markdown.extension.orderedList.autoRenumber\": false}Settings | Extensions |Markdown One | Ordered List | Marker: one {\"markdown.extension.orderedList.marker\": \"one\"}","code":"code --list-extensions\ncode --install-extension GrapeCity.gc-excelviewer\ncode --install-extension mechatroner.rainbow-csv\ncode --install-extension ms-mssql.mssql\ncode --install-extension streetsidesoftware.code-spell-checker\ncode --install-extension yzhang.markdown-all-in-one\ncode --install-extension yzane.markdown-pdf\ncode --install-extension DavidAnson.vscode-markdownlint{\n  \"diffEditor.ignoreTrimWhitespace\": false,\n  \"diffEditor.maxComputationTime\": 0,\n  \"editor.acceptSuggestionOnEnter\": \"off\",\n  \"editor.renderWhitespace\": \"all\",\n  \"explorer.confirmDragAndDrop\": false,\n  \"files.associations\": {\n      \"*.Rmd\": \"markdown\"\n  },\n  \"files.trimFinalNewlines\": true,\n  \"files.trimTrailingWhitespace\": true,\n  \"git.autofetch\": true,\n  \"git.confirmSync\": false,\n  \"window.zoomLevel\": 2,\n\n  \"markdown.extension.orderedList.autoRenumber\": false,\n  \"markdown.extension.orderedList.marker\": \"one\",\n  \"markdownlint.config\": {\n      \"MD003\": { \"style\": \"setext_with_atx\" },\n      \"MD007\": { \"indent\": 2 },\n      \"MD022\": { \"lines_above\": 1,\n                  \"lines_below\": 1 },\n      \"MD024\": { \"siblings_only\": true },\n      \"no-bare-urls\": false,\n      \"no-inline-html\": {\n        \"allowed_elements\": [\n          \"mermaid\",\n          \"a\",\n          \"br\",\n          \"details\",\n          \"img\"\n        ]\n      }\n  }\n}"},{"path":"workstation.html","id":"workstation-optional","chapter":"19 Workstation","heading":"19.3 Optional Installation","text":"installation order matter.","code":""},{"path":"workstation.html","id":"workstation-git","chapter":"19 Workstation","heading":"19.3.1 Git","text":"Git command-line utility enables advanced operations GitHub client doesn’t support. Use default installation options, except preferences :\n1. Nano default text editor.","code":""},{"path":"workstation.html","id":"workstation-calc","chapter":"19 Workstation","heading":"19.3.2 LibreOffice Calc","text":"LibreOffice Calc alternative Excel. Unlike Excel, doesn’t guess much formatting (usually mess things, especially dates).","code":""},{"path":"workstation.html","id":"workstation-pandoc","chapter":"19 Workstation","heading":"19.3.3 pandoc","text":"pandoc converts files one markup format another. {added Sept 2012}","code":""},{"path":"workstation.html","id":"workstation-python","chapter":"19 Workstation","heading":"19.3.4 Python","text":"Python used analysts. prototypical installation involves two options.Anaconda, include Jupyter Notebooks, Jupyter Lab, Spyder. Plus two programs already list: RStudio VS Code. Windows, open “Anaconda Prompt” administrative privileges\nconda install numpy pandas scikit-learn matplotlibAnaconda, include Jupyter Notebooks, Jupyter Lab, Spyder. Plus two programs already list: RStudio VS Code. Windows, open “Anaconda Prompt” administrative privilegesStandard Python, installing packages pip3 terminal. pip3 command unrecognized ’s missing OS path variable, alternative py -3 -mpip install paramiko; calls pip py command sometimes path variable installation.\nusing Windows .msi installer, recommended options \nCheck “Add Python 3.10 PATH”\nCheck “Install launcher users (recommended)”\nClick “Customize Installation”\nOptional Features\nCheck “Documentation”\nCheck “pip”\n“users (requires elevation)”\n\nAdvanced Options\nCheck “Install users” (set install path something like C:\\Program Files\\Python310.)\nCheck “Add Python environment variables”\nCheck “Precompile standard library”\n\nmsi completes:\nAdd entry like C:\\Users\\USERNAME\\AppData\\Roaming\\Python\\Python310 C:\\Users\\USERNAME\\AppData\\Local\\Programs\\Python\\Python310 System Variables scripts personal AppData directory (even clicked “Install users”). helps RStudio/reticulate run python scripts.\nInstall Python packages PowerShell command line (Python)\npy -3 -mpip install biopython matplotlib numpy pandas paramiko pyarrow pyodbc pyyaml scikit-learn scipy sqlalchemy strictyaml\n\nStandard Python, installing packages pip3 terminal. pip3 command unrecognized ’s missing OS path variable, alternative py -3 -mpip install paramiko; calls pip py command sometimes path variable installation.using Windows .msi installer, recommended options areCheck “Add Python 3.10 PATH”Check “Add Python 3.10 PATH”Check “Install launcher users (recommended)”Check “Install launcher users (recommended)”Click “Customize Installation”Click “Customize Installation”Optional Features\nCheck “Documentation”\nCheck “pip”\n“users (requires elevation)”\nOptional FeaturesCheck “Documentation”Check “pip”“users (requires elevation)”Advanced Options\nCheck “Install users” (set install path something like C:\\Program Files\\Python310.)\nCheck “Add Python environment variables”\nCheck “Precompile standard library”\nAdvanced OptionsCheck “Install users” (set install path something like C:\\Program Files\\Python310.)Check “Add Python environment variables”Check “Precompile standard library”msi completes:\nAdd entry like C:\\Users\\USERNAME\\AppData\\Roaming\\Python\\Python310 C:\\Users\\USERNAME\\AppData\\Local\\Programs\\Python\\Python310 System Variables scripts personal AppData directory (even clicked “Install users”). helps RStudio/reticulate run python scripts.\nInstall Python packages PowerShell command line (Python)\npy -3 -mpip install biopython matplotlib numpy pandas paramiko pyarrow pyodbc pyyaml scikit-learn scipy sqlalchemy strictyaml\nmsi completes:Add entry like C:\\Users\\USERNAME\\AppData\\Roaming\\Python\\Python310 C:\\Users\\USERNAME\\AppData\\Local\\Programs\\Python\\Python310 System Variables scripts personal AppData directory (even clicked “Install users”). helps RStudio/reticulate run python scripts.Add entry like C:\\Users\\USERNAME\\AppData\\Roaming\\Python\\Python310 C:\\Users\\USERNAME\\AppData\\Local\\Programs\\Python\\Python310 System Variables scripts personal AppData directory (even clicked “Install users”). helps RStudio/reticulate run python scripts.Install Python packages PowerShell command line (Python)\npy -3 -mpip install biopython matplotlib numpy pandas paramiko pyarrow pyodbc pyyaml scikit-learn scipy sqlalchemy strictyamlInstall Python packages PowerShell command line (Python)Updating Packages Python packages don’t need updated frequently R packages, ’s still good every months.\nPaste single line PowerShell Windows. (Stack Overflow solution Sébastien Wieckowski)\npip list -o --format json | ConvertFrom-Json | foreach {pip install $_.name -U ---warn-script-location}\nPaste single line Bash terminal Linux. (ActiveState.com post.\npip3 list --outdated --format=freeze | grep -v '^\\-e' | cut -d = -f 1 | xargs -n1 pip3 install -U Updating Packages Python packages don’t need updated frequently R packages, ’s still good every months.Paste single line PowerShell Windows. (Stack Overflow solution Sébastien Wieckowski)Paste single line Bash terminal Linux. (ActiveState.com post.","code":"conda install numpy pandas scikit-learn matplotlibpy -3 -mpip install biopython matplotlib numpy pandas paramiko pyarrow pyodbc pyyaml scikit-learn scipy sqlalchemy strictyamlpip list -o --format json | ConvertFrom-Json | foreach {pip install $_.name -U --no-warn-script-location}pip3 list --outdated --format=freeze | grep -v '^\\-e' | cut -d = -f 1 | xargs -n1 pip3 install -U "},{"path":"workstation.html","id":"workstation-pilot-edit","chapter":"19 Workstation","heading":"19.3.5 PilotEdit","text":"PilotEdit can load huge text files fit RAM, files 100MB choke Excel, Calc, Notepad++, Visual Studio Code.Like Notepad++ VS Code, PilotEdit good Find features can () present search hits within file, (b) scan multiple files, (c) use regular expressions. helps trace origin problems pipeline. example, data warehouse suspicious character patient 10009’s BMI value, regex \\b10009\\tbmi\\b locates origin among multiple 1+GB files received.PilotEdit also good tool occasional data extract encoding problem. can side--side inspect hex code (visible non-visible) character produced (example ascii, “76” produces “v” “0A” produces line feed). {Added Sept 2020}","code":""},{"path":"workstation.html","id":"workstation-assets","chapter":"19 Workstation","heading":"19.4 Asset Locations","text":"GitHub repository https://github.com/OuhscBbmc/RedcapExamplesAndPatterns {added Sept 2012}GitHub repository https://github.com/OuhscBbmc/RedcapExamplesAndPatterns {added Sept 2012}File server directory Ask PI. Peds, ’s typically “S” drive.File server directory Ask PI. Peds, ’s typically “S” drive.SQL Server Database Ask Thomas, DavidSQL Server Database Ask Thomas, DavidREDCap database Ask Thomas, David. http url, ’re trying publicize value.REDCap database Ask Thomas, David. http url, ’re trying publicize value.ODBC UserDsn name depends specific repository, SQL Server database. Ask Thomas, David set .ODBC UserDsn name depends specific repository, SQL Server database. Ask Thomas, David set .","code":""},{"path":"workstation.html","id":"workstation-administrator","chapter":"19 Workstation","heading":"19.5 Administrator Installation","text":"programs useful people administrating servers, typical data scientist.","code":""},{"path":"workstation.html","id":"workstation-mysql","chapter":"19 Workstation","heading":"19.5.1 MySQL Workbench","text":"MySQL Workbench useful occasionally REDCap admins.","code":""},{"path":"workstation.html","id":"workstation-postman","chapter":"19 Workstation","heading":"19.5.2 Postman","text":"Postman Native App useful developing API replaced Chrome app. ’s possible, web client available well. either program, access PHI.","code":""},{"path":"workstation.html","id":"workstation-ssms","chapter":"19 Workstation","heading":"19.5.3 SQL Server Management Studio (SSMS)","text":"SQL Server Management Studio replaced Azure Data Studio roles, still recommended database administrators. easy way access database write queries (transfer SQL R file). ’s required REDCap API, ’s usually necessary integrating REDCap databases.Note: non-default changes facilitate workflow. first two help save database structure (data) GitHub, can easily track/monitor structural changes time. tabs options keeps things consistent editors. SSMS ‘Tools | Options’ dialog box:SQL Server Object Explorer | Scripting | Include descriptive headers: FalseSQL Server Object Explorer | Scripting | Script extended properties: FalseText Editor | Languages | Tabs | Tab size: 2Text Editor | Languages | Tabs | Indent size: 2Text Editor | Languages | Tabs | Insert Spaces: trueThese don’t affect saved files, make life easier. first makes result font bigger.Environment | Fonts Colors | Show settings : Grid Results | Size: 10Query Results | SQL Server | Results Grid | Include column headers copying saving results: false`Designers | Table Database Designers | Prevent saving changes require table-recreation: falseText Editor | Editor Tab Status Bar | Tab Text | Include Server Name: falseText Editor | Editor Tab Status Bar | Tab Text | Include Database Name: falseText Editor | Editor Tab Status Bar | Tab Text | Include Login Name: falseText Editor | Languages | General | Line Numbers: trueA dark theme unofficially supported SSMS 18. write privileges “Program Files” directory, quick modification config file reduce eye strain. change also prevents screen flashing dark--light--dark, broadcasts wandering attention Zoom meeting.details, see setting--dev-machine.md (private repo ’s restricted BBMC members).","code":""},{"path":"workstation.html","id":"workstation-winscp","chapter":"19 Workstation","heading":"19.5.4 WinSCP","text":"WinSCP GUI SCP SFTP file transfer using SSH keys. tool occasionally useful admins collaborating institutions OU computing resources. PHI can accidentally sent collaborators without DUA, recommend WinSCP installed informed administrators. typical data scientist teams need tool.alternative FileZilla. works multiple OSes, currently doesn’t support scp (sftp).","code":""},{"path":"workstation.html","id":"workstation-troubleshooting","chapter":"19 Workstation","heading":"19.6 Installation Troubleshooting","text":"Git: Beasley resorted workaround Sept 2012: http://stackoverflow.com/questions/3431361/git--windows--program-cant-start--libiconv2-dll--missing. copied following four files D:/Program Files/msysgit/mingw/bin/ D:/Program Files/msysgit/bin/: (1) libiconv2.dll, (2) libcurl-4.dll, (3) libcrypto.dll, (4) libssl.dll. (install default location, ’ll move instead C:/msysgit/mingw/bin/ C:/msysgit/bin/) {added Sept 2012}Git: Beasley resorted workaround Sept 2012: http://stackoverflow.com/questions/3431361/git--windows--program-cant-start--libiconv2-dll--missing. copied following four files D:/Program Files/msysgit/mingw/bin/ D:/Program Files/msysgit/bin/: (1) libiconv2.dll, (2) libcurl-4.dll, (3) libcrypto.dll, (4) libssl.dll. (install default location, ’ll move instead C:/msysgit/mingw/bin/ C:/msysgit/bin/) {added Sept 2012}Git: different computer, Beasley couldn’t get RStudio recognize msysGit, installed Full installer official Git Windows 1.7.11 (http://code.google.com/p/msysgit/downloads/list) switched Git Path RStudio Options. {added Sept 2012}Git: different computer, Beasley couldn’t get RStudio recognize msysGit, installed Full installer official Git Windows 1.7.11 (http://code.google.com/p/msysgit/downloads/list) switched Git Path RStudio Options. {added Sept 2012}RStudio\nsomething goes wrong RStudio, re-installing might fix issue, personal preferences aren’t erased. safe, can thorough delete equivalent C:\\Users\\wibeasley\\AppData\\Local\\RStudio\\. options settings stored (can manipulated) extensionless text file: C:\\Users\\wibeasley\\AppData\\Local\\RStudio\\monitored\\user-settings\\user-settings. See RStudio’s support page, Resetting RStudio Desktop’s State. {added Sept 2012}\nHold ctrl button clicking RStudio Windows Start Menu. Try switching 64/32-bit option. VDI, forcing software-rendering option fixed problem RStudio window opened, nothing visible inside. {added Jan 2022}\nmight help look logs, stored equivalent C:\\Users\\wibeasley\\AppData\\Local\\RStudio\\logs {added Jan 2022}\nRStudioIf something goes wrong RStudio, re-installing might fix issue, personal preferences aren’t erased. safe, can thorough delete equivalent C:\\Users\\wibeasley\\AppData\\Local\\RStudio\\. options settings stored (can manipulated) extensionless text file: C:\\Users\\wibeasley\\AppData\\Local\\RStudio\\monitored\\user-settings\\user-settings. See RStudio’s support page, Resetting RStudio Desktop’s State. {added Sept 2012}Hold ctrl button clicking RStudio Windows Start Menu. Try switching 64/32-bit option. VDI, forcing software-rendering option fixed problem RStudio window opened, nothing visible inside. {added Jan 2022}might help look logs, stored equivalent C:\\Users\\wibeasley\\AppData\\Local\\RStudio\\logs {added Jan 2022}Quarto\n(rendering document) encounter error like compilation failed- matching packages ...LaTeX Error: File 'scrreprt.cls' found., ’ll need replace installation tinytex.\nFirst uinstall & remove via R.\n\ntinytex::uninstall_tinytex()\nremove.packages(\"tinytex\")\nreinstall via command line PowerShell.\nquarto tools install tinytex\nQuartoIf (rendering document) encounter error like compilation failed- matching packages ...LaTeX Error: File 'scrreprt.cls' found., ’ll need replace installation tinytex.\nFirst uinstall & remove via R.\n\ntinytex::uninstall_tinytex()\nremove.packages(\"tinytex\")\nreinstall via command line PowerShell.\nquarto tools install tinytexIf (rendering document) encounter error like compilation failed- matching packages ...LaTeX Error: File 'scrreprt.cls' found., ’ll need replace installation tinytex.First uinstall & remove via R.reinstall via command line PowerShell.","code":"\ntinytex::uninstall_tinytex()\nremove.packages(\"tinytex\")quarto tools install tinytex"},{"path":"workstation.html","id":"workstation-windows","chapter":"19 Workstation","heading":"19.7 Windows Installation","text":"","code":""},{"path":"workstation.html","id":"workstation-windows-explorer","chapter":"19 Workstation","heading":"19.7.1 File Explorer","text":"reviewing repo files, ’s frequently important see file extensions hidden files File Explorer.View Menu: check box “File name extensions”View Menu: check box “Hidden items”","code":""},{"path":"workstation.html","id":"workstation-ubuntu","chapter":"19 Workstation","heading":"19.8 Ubuntu Installation","text":"","code":""},{"path":"workstation.html","id":"workstation-ubuntu-r","chapter":"19 Workstation","heading":"19.8.1 R","text":"Check https://cran.r-project.org/bin/linux/ubuntu/ recent instructions.","code":"  ### Add the key, update the list, then install base R.\n  sudo apt update -qq\n  sudo apt install --no-install-recommends software-properties-common dirmngr\n  wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc\n  sudo add-apt-repository \"deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/\"\n  sudo apt-get install r-base r-base-dev"},{"path":"workstation.html","id":"workstation-ubuntu-rstudio","chapter":"19 Workstation","heading":"19.8.2 RStudio","text":"Download recent version https://www.rstudio.com/products/rstudio/download/#download. run two gdebi() lines.\nAlternatively, update wget line recent version.","code":"  # wget https://download1.rstudio.org/desktop/bionic/amd64/rstudio-1.4.1717-amd64.deb\n  sudo apt-get install gdebi-core\n  sudo gdebi rstudio-*-amd64.deb"},{"path":"workstation.html","id":"workstation-ubuntu-packages","chapter":"19 Workstation","heading":"19.8.3 apt-get Packages","text":"next block can copied pasted (ctrl-shift-v) console entirely. lines can pasted individual (without ( function install-packages { line, last three lines).","code":"( function install-packages {\n\n  ### Git\n  sudo apt-get install git-core\n  git config --global user.email \"wibeasley@hotmail.com\"\n  git config --global user.name \"Will Beasley\"\n  git config --global credential.helper 'cache --timeout=3600000'\n\n  ### Ubuntu & Bioconductor packages that are indirectly needed for packages and BBMC scripts\n\n  # Supports the `locate` command in bash\n  sudo apt-get install mlocate\n\n  # The genefilter package is needed for 'modeest' on CRAN.\n  # No longer a modeest dependency: Rscript -e 'BiocManager::install(\"genefilter\")'\n\n  ### CRAN packages that are also on the Ubuntu repositories\n\n  # The 'xml2' package; https://CRAN.R-project.org/package=xml2\n  sudo apt-get --yes install libxml2-dev r-cran-xml\n\n  # The 'curl' package, and others; https://CRAN.R-project.org/package=curl\n  sudo apt-get --yes install libssl-dev libcurl4-openssl-dev\n\n  # The 'udunits2' package: https://cran.r-project.org/web/packages/udunits2/index.html\n  sudo apt-get --yes install libudunits2-dev\n\n  # The 'odbc' package: https://github.com/r-dbi/odbc#linux---debian--ubuntu\n  sudo apt-get --yes install unixodbc-dev tdsodbc odbc-postgresql libsqliteodbc\n\n  # The 'rgl' package; https://stackoverflow.com/a/39952771/1082435\n  sudo apt-get --yes install libcgal-dev libglu1-mesa-dev\n\n  # The 'gsl' package; https://cran.rstudio.com/web/packages/gsl/INSTALL\n  sudo apt-get --yes install libgsl0-dev\n\n  # The 'magick' package; https://docs.ropensci.org/magick/articles/intro.html#build-from-source\n  sudo apt-get --yes install 'libmagick++-dev'\n\n  # To compress vignettes when building a package; https://kalimu.github.io/post/checklist-for-r-package-submission-to-cran/\n  sudo apt-get --yes install qpdf\n\n  # The 'pdftools' and 'Rpoppler' packages, which involve PDFs\n  sudo apt-get --yes install libpoppler-cpp-dev libpoppler-glib-dev\n\n  # The 'sys' package\n  sudo apt-get --yes install libapparmor-dev\n\n  # The 'archive' package; https://CRAN.R-project.org/package=archive\n  sudo apt-get --yes install libarchive-dev\n\n  # The 'sf' and other spatial packages: https://github.com/r-spatial/sf#ubuntu; https://github.com/r-spatial/sf/pull/1208\n  sudo apt-get --yes install libudunits2-dev libgdal-dev libgeos-dev libproj-dev libgeos++-dev\n\n  # For Cairo package, a dependency of Shiny & plotly; https://gykovacsblog.wordpress.com/2017/05/15/installing-cairo-for-r-on-ubuntu-17-04/\n  sudo apt-get --yes install libcairo2-dev\n\n  # 'rJava' and others; https://www.r-bloggers.com/installing-rjava-on-ubuntu/\n  sudo apt-get --yes install default-jre default-jdk\n  sudo R CMD javareconf\n  sudo apt-get --yes install r-cran-rjava\n\n  # For reprex and sometimes ssh keys; https://github.com/tidyverse/reprex#installation\n  sudo apt-get --yes install xclip\n\n  # gifski -apparently the rust compiler is necessary\n  sudo apt-get --yes install cargo\n\n  # For databases\n  sudo apt-get --yes install sqlite sqliteman\n  sudo apt-get --yes install postgresql postgresql-contrib pgadmin3\n\n  # pandoc\n  sudo apt-get --yes install pandoc\n\n  # For checking packages. Avoid `/usr/bin/texi2dvi: not found` warning.\n  sudo apt-get install texinfo\n}\ninstall-packages\n)"},{"path":"workstation.html","id":"workstation-ubuntu-pandoc","chapter":"19 Workstation","heading":"19.8.4 Pandoc","text":"version pandoc Ubuntu repository may delayed. install latest version, download .deb file install directory. Finally, verify version.","code":"sudo dpkg -i pandoc-*\npandoc -v"},{"path":"workstation.html","id":"workstation-ubuntu-postman","chapter":"19 Workstation","heading":"19.8.5 Postman","text":"Postman native app Ubuntu installed snap, updated daily automatically.","code":"snap install postman"},{"path":"workstation.html","id":"workstation-retired","chapter":"19 Workstation","heading":"19.9 Retired Tools","text":"previously installed software . replaced software ’s either newer natural use.GitLab SSL Certificate isn’t software, still needs configured.\nTalk server URL *.cer file.\nSave file something like ~/keys/ca-bundle-gitlab.cer\nAssociate file git config --global http.sslCAInfo ...path.../ca-bundle-gitlab.cer (replace ...path...).\nGitLab SSL Certificate isn’t software, still needs configured.Talk server URL *.cer file.Save file something like ~/keys/ca-bundle-gitlab.cerAssociate file git config --global http.sslCAInfo ...path.../ca-bundle-gitlab.cer (replace ...path...).MiKTeX necessary ’re using knitr Sweave produce LaTeX files (just markdown files). ’s huge, slow installation can take hour two. {added Sept 2012}MiKTeX necessary ’re using knitr Sweave produce LaTeX files (just markdown files). ’s huge, slow installation can take hour two. {added Sept 2012}Pulse Secure VPN client OUHSC researchers. ’s required REDCap API, ’s usually necessary communicate campus data sources.Pulse Secure VPN client OUHSC researchers. ’s required REDCap API, ’s usually necessary communicate campus data sources.msysGit allows RStudio track changes commit & sync GitHub server. Connect RStudio GitHub repository. moved optional (Oct 14, 2012) GitHub client (see ) almost everything RStudio plugin ; little better little robust; installation hasn’t given problems. {added Oct 2012}\nStarting top right RStudio, click: Project -> New Project -> Create Project Version Control -> Git {added Sept 2012}\nexample repository URL https://github.com/OuhscBbmc/RedcapExamplesAndPatterns. Specify location save (copy ) project local computer. {added Sept 2012}\nmsysGit allows RStudio track changes commit & sync GitHub server. Connect RStudio GitHub repository. moved optional (Oct 14, 2012) GitHub client (see ) almost everything RStudio plugin ; little better little robust; installation hasn’t given problems. {added Oct 2012}Starting top right RStudio, click: Project -> New Project -> Create Project Version Control -> Git {added Sept 2012}example repository URL https://github.com/OuhscBbmc/RedcapExamplesAndPatterns. Specify location save (copy ) project local computer. {added Sept 2012}CSVed lightweight program viewing data files. fits somewhere text editor Excel.CSVed lightweight program viewing data files. fits somewhere text editor Excel.SourceTree rich client many features GitHub client. don’t recommend beginners, since ways mess things. developers, nicely fills spot GitHub client command-line operations. branching visualization really nice . Unfortunately ironically, doesn’t currently support Linux. {added Sept 2014}.SourceTree rich client many features GitHub client. don’t recommend beginners, since ways mess things. developers, nicely fills spot GitHub client command-line operations. branching visualization really nice . Unfortunately ironically, doesn’t currently support Linux. {added Sept 2014}.git-cola probably best GUI Git supported Linux. ’s available official Ubuntu repositories apt-get (also see ). branch visualization features different, related program, ‘git dag’. {added Sept 2014}git-cola probably best GUI Git supported Linux. ’s available official Ubuntu repositories apt-get (also see ). branch visualization features different, related program, ‘git dag’. {added Sept 2014}GitHub Eclipse something discourage beginner, strongly recommend start RStudio (GitHub Client git capabilities within RStudio) months even consider Eclipse. ’s included list sake completeness. installing EGit plug-, ignore eclipse site check youtube video:http://www.youtube.com/watch?v=I7fbCE5nWPU.GitHub Eclipse something discourage beginner, strongly recommend start RStudio (GitHub Client git capabilities within RStudio) months even consider Eclipse. ’s included list sake completeness. installing EGit plug-, ignore eclipse site check youtube video:http://www.youtube.com/watch?v=I7fbCE5nWPU.Color Oracle simulates three common types color blindness. produce color graph report develop, check Color Oracle (ask someone else ). ’s already installed, takes less 10 second check three types color blindness. ’s installed, extra work may necessary Java isn’t already installed. download zip, extract ColorOracle.exe program like. {added Sept 2012}Color Oracle simulates three common types color blindness. produce color graph report develop, check Color Oracle (ask someone else ). ’s already installed, takes less 10 second check three types color blindness. ’s installed, extra work may necessary Java isn’t already installed. download zip, extract ColorOracle.exe program like. {added Sept 2012}Atom text editor, similar Notepad++. Notepad++ appears efficient opening large CSVs. Atom better suited editing lot files repository. finding replacing across lot files, superior Notepad++ RStudio; permits regexes great GUI preview potential replacements.\nProductivity enhanced following Atom packages:\nSublime Style Column Selection: Enable Sublime style ‘Column Selection’. Just hold ‘alt’ select, select using middle mouse button.\natom-language-r allows Atom recognize files R. prevents spell checking indicators enable syntax highlighting. need browse lot scattered R files quickly, Atom’s tree panel (left) works well. older alternative language-r.\nlanguage-csv: Adds syntax highlighting CSV files. highlighting nice, automatically disables spell checking lines.\natom-beautify: Beautify HTML, CSS, JavaScript, PHP, Python, Ruby, Java, C, C++, C#, Objective-C, CoffeeScript, TypeScript, Coldfusion, SQL, Atom.\natom-wrap--tag: wraps tag around selection; just select word phrase hit Alt + Shift + w.\nminimap: preview full source code (right margin).\nscript: Run scripts based file name, selection code, line number.\ngit-plus: git things without terminal (don’t think necessary anymore).\npackages can installed Atom, apm utility command line:\napm install sublime-style-column-selection atom-language-r language-csv atom-beautify atom-wrap--tag minimap script\nfollowing settings keep files consistent among developers.\nFile | Settings | Editor | Tab Length: 2 (opposed 3 4, used conventions)\nFile | Settings | Editor | Tab Type: soft (inserts 2 spaces instead tab ‘Tab’ pressed)\nAtom text editor, similar Notepad++. Notepad++ appears efficient opening large CSVs. Atom better suited editing lot files repository. finding replacing across lot files, superior Notepad++ RStudio; permits regexes great GUI preview potential replacements.Productivity enhanced following Atom packages:Sublime Style Column Selection: Enable Sublime style ‘Column Selection’. Just hold ‘alt’ select, select using middle mouse button.atom-language-r allows Atom recognize files R. prevents spell checking indicators enable syntax highlighting. need browse lot scattered R files quickly, Atom’s tree panel (left) works well. older alternative language-r.language-csv: Adds syntax highlighting CSV files. highlighting nice, automatically disables spell checking lines.atom-beautify: Beautify HTML, CSS, JavaScript, PHP, Python, Ruby, Java, C, C++, C#, Objective-C, CoffeeScript, TypeScript, Coldfusion, SQL, Atom.atom-wrap--tag: wraps tag around selection; just select word phrase hit Alt + Shift + w.minimap: preview full source code (right margin).script: Run scripts based file name, selection code, line number.git-plus: git things without terminal (don’t think necessary anymore).packages can installed Atom, apm utility command line:following settings keep files consistent among developers.File | Settings | Editor | Tab Length: 2 (opposed 3 4, used conventions)File | Settings | Editor | Tab Type: soft (inserts 2 spaces instead tab ‘Tab’ pressed)","code":"apm install sublime-style-column-selection atom-language-r language-csv atom-beautify atom-wrap-in-tag minimap script"},{"path":"tools.html","id":"tools","chapter":"20 Considerations when Selecting Tools","heading":"20 Considerations when Selecting Tools","text":"","code":""},{"path":"tools.html","id":"general","chapter":"20 Considerations when Selecting Tools","heading":"20.1 General","text":"","code":""},{"path":"tools.html","id":"the-components-goal","chapter":"20 Considerations when Selecting Tools","heading":"20.1.1 The Component’s Goal","text":"discussing advantages disadvantages tools, colleague said, “Tidyverse packages don’t anything can’t already Base R, sometimes even requires lines code”. Regardless agree, feel two points irrelevant. Sometimes advantage tool isn’t expand existing capabilities, rather facilitate development maintenance capability.Likewise, care less line count, readability. ’d prefer maintain 20-line chunk familiar readable 10-line chunk dense phrases unfamiliar functions. bottleneck projects human time, execution time.","code":""},{"path":"tools.html","id":"current-skill-set-of-team","chapter":"20 Considerations when Selecting Tools","heading":"20.1.2 Current Skill Set of Team","text":"","code":""},{"path":"tools.html","id":"desired-future-skill-set-of-team","chapter":"20 Considerations when Selecting Tools","heading":"20.1.3 Desired Future Skill Set of Team","text":"","code":""},{"path":"tools.html","id":"skill-set-of-audience","chapter":"20 Considerations when Selecting Tools","heading":"20.1.4 Skill Set of Audience","text":"","code":""},{"path":"tools.html","id":"languages","chapter":"20 Considerations when Selecting Tools","heading":"20.2 Languages","text":"","code":""},{"path":"tools.html","id":"r-packages","chapter":"20 Considerations when Selecting Tools","heading":"20.3 R Packages","text":"developing codebase used many people, choose packages functionality, well ease installation maintainability. example, rJava package powerful package allows R package developers leverage widespread Java framework many popular Java packages. However, installing Java setting appropriate path registry settings can error-prone, especially non-developers.\nTherefore considering two functions comparable capabilities (e.g., xlsx::read.xlsx() readxl::read_excel()), avoid package requires proper installation configuration Java rJava.\nintensive choice required (say, need capability xlsx missing readxl), take:\n20 minutes start markdown file enumerates package’s direct indirect dependencies require manual configuration (e.g., rJava Java), download , typical installation steps.\n5 minutes create GitHub Issue () announces new requirement, (b) describes /needs install requirement, (c) points markdown documentation, (d) encourages teammates post problems, recommendations, solutions issue. ’ve found dedicated Issue helps communicate package dependency necessitates intention encourages people assist people’s troubleshooting. something potentially useful posted Issue, move markdown document. Make sure document issue hyperlink .\n15 minutes every year re-evaluate landscape. Confirm package still actively maintained, newer (easily- maintained) package offers desired capability.12 better fit now exists, evaluate effort transition new package worth benefit. willing transition project relatively green, development upcoming. willing transition transition relatively -place, require much modification code training people.\ndeveloping codebase used many people, choose packages functionality, well ease installation maintainability. example, rJava package powerful package allows R package developers leverage widespread Java framework many popular Java packages. However, installing Java setting appropriate path registry settings can error-prone, especially non-developers.Therefore considering two functions comparable capabilities (e.g., xlsx::read.xlsx() readxl::read_excel()), avoid package requires proper installation configuration Java rJava.intensive choice required (say, need capability xlsx missing readxl), take:20 minutes start markdown file enumerates package’s direct indirect dependencies require manual configuration (e.g., rJava Java), download , typical installation steps.20 minutes start markdown file enumerates package’s direct indirect dependencies require manual configuration (e.g., rJava Java), download , typical installation steps.5 minutes create GitHub Issue () announces new requirement, (b) describes /needs install requirement, (c) points markdown documentation, (d) encourages teammates post problems, recommendations, solutions issue. ’ve found dedicated Issue helps communicate package dependency necessitates intention encourages people assist people’s troubleshooting. something potentially useful posted Issue, move markdown document. Make sure document issue hyperlink .5 minutes create GitHub Issue () announces new requirement, (b) describes /needs install requirement, (c) points markdown documentation, (d) encourages teammates post problems, recommendations, solutions issue. ’ve found dedicated Issue helps communicate package dependency necessitates intention encourages people assist people’s troubleshooting. something potentially useful posted Issue, move markdown document. Make sure document issue hyperlink .15 minutes every year re-evaluate landscape. Confirm package still actively maintained, newer (easily- maintained) package offers desired capability.12 better fit now exists, evaluate effort transition new package worth benefit. willing transition project relatively green, development upcoming. willing transition transition relatively -place, require much modification code training people.15 minutes every year re-evaluate landscape. Confirm package still actively maintained, newer (easily- maintained) package offers desired capability.12 better fit now exists, evaluate effort transition new package worth benefit. willing transition project relatively green, development upcoming. willing transition transition relatively -place, require much modification code training people.Finally, consider much traffic passes dependency brittle dependency disruptive isolated downstream analysis file run one statistician. hand, protective middle pipeline typically team runs.","code":""},{"path":"tools.html","id":"database","chapter":"20 Considerations when Selecting Tools","heading":"20.4 Database","text":"Ease installation & maintenanceEase installation & maintenanceSupport –database engine comfortable supporting.Support –database engine comfortable supporting.Integration LDAP, Active Directory, Shibboleth.Integration LDAP, Active Directory, Shibboleth.Warehouse vs transactional performanceWarehouse vs transactional performance","code":""},{"path":"tools.html","id":"additional-resources-2","chapter":"20 Considerations when Selecting Tools","heading":"20.5 Additional Resources","text":"(Colin Gillespie 2017), particularly “Package selection” section.","code":""},{"path":"team.html","id":"team","chapter":"21 Growing a Team","heading":"21 Growing a Team","text":"","code":""},{"path":"team.html","id":"recruiting","chapter":"21 Growing a Team","heading":"21.1 Recruiting","text":"","code":""},{"path":"team.html","id":"training-to-data-science","chapter":"21 Growing a Team","heading":"21.2 Training to Data Science","text":"Starting ResearcherStarting StatisticianStarting DBAStarting Software Developer","code":""},{"path":"team.html","id":"bridges-outside-the-team","chapter":"21 Growing a Team","heading":"21.3 Bridges Outside the Team","text":"Monthly User GroupsAnnual Conferences","code":""},{"path":"redcap-user.html","id":"redcap-user","chapter":"22 Material for REDCap Users","heading":"22 Material for REDCap Users","text":"","code":""},{"path":"redcap-user.html","id":"redcap-user-login","chapter":"22 Material for REDCap Users","heading":"22.1 Login","text":"","code":""},{"path":"redcap-user.html","id":"redcap-user-report-develop","chapter":"22 Material for REDCap Users","heading":"22.2 Developing Reports","text":"Please first read Login","code":""},{"path":"redcap-developer.html","id":"redcap-developer","chapter":"23 Material for REDCap Developers","heading":"23 Material for REDCap Developers","text":"","code":""},{"path":"redcap-admin.html","id":"redcap-admin","chapter":"24 Material for REDCap Admins","heading":"24 Material for REDCap Admins","text":"","code":""},{"path":"git.html","id":"git","chapter":"A Git & GitHub","heading":"A Git & GitHub","text":"","code":""},{"path":"git.html","id":"git-justification","chapter":"A Git & GitHub","heading":"A.1 Justification","text":"(Written 2017 justify service corporation’s department.)Git GitHub de facto version control software hosting solution software development modern data science. Using GitHub help group three critical tasks: () developing software, (b) leveraging innovations others, (c) attracting top talent.Developing Software: Version control critical developing quality software, especially multiple data scientists contributing code bank. Among modern version control software, Git GitHub popular new projects, especially among talent pool recruit . Compared outdated approaches using conventional file-servers, version control substantially increases productivity. Analysts can develop code & report parallel, combine branch mature. Additionally, commits saved indefinitely, allowing us ‘turn back clock’ resurrect older code necessary. also allows us organize manage proprietary code single (distributed) location.Given needs small data science team, believe private GitHub repositories (secured two-factor authentication) strike nice balance () security, (b) ease use developers, (c) ease maintenance administrators, (d) cost.Leveraging Innovation: cutting-edge data science algorithms released GitHub. algorithms stand-alone software; instead augment statistical software, R, approved . Furthermore, GitHub.com hosts documentation user forums data science algorithms. Without access information, greater risk misunderstanding misusing routines, weaken accuracy financial reports produce.Attracting Talent: compete top talent highly competitive field data science, want provide access standard tools. want send message organization doesn’t value advancements appreciated employed competitors.Alternatives: GitHub approach described common, approached endorsed contemporary developers. Others include:GitHub Enterprise: hosting solution developed GitHub, hosted university-controlled VM.GitLab: competitor GitHub. GitLab uses Git, different hosting options, cloud -premises.Mercurial: modern version control similar Git. many Git’s strengths avoids many undesirable features Subversion/SVN.Atlassian: competitor GitHub focuses businesses. Altassian/Bitbucket repositories can use Git Mercurial. Like GitHub GitLab, offers different hosting options.Resources:GitHub BusinessGit Teams","code":""},{"path":"git.html","id":"git-code","chapter":"A Git & GitHub","heading":"A.2 for Code Development","text":"Jenny Bryan Jim Hester published thorough description using Git data scientist’s perspective (Happy Git GitHub useR), recommend following guidance. consistent approach, exceptions noted . complementary resource Team Geek, insightful advice human collaborative aspects version control.ResourcesSetting CI/CD Process GitHub Travis CI. Travis-CI blob August 2019.","code":""},{"path":"git.html","id":"git-collaboration","chapter":"A Git & GitHub","heading":"A.3 for Collaboration","text":"Somewhat separate ’s version control capabilities, GitHub provides built-tools coordinating projects across people time. tools revolves around GitHub Issues, allow teammates toSomewhat separate ’s version control capabilities, GitHub provides built-tools coordinating projects across people time. tools revolves around GitHub Issues, allow teammates totrack issues assigned otherstrack issues assigned otherssearch teammates encountered similar problems facing now (e.g., new computer can’t install rJava package).search teammates encountered similar problems facing now (e.g., new computer can’t install rJava package).’s nothing magical GitHub issues, don’t use , consider using similar capable tools like offered Atlassian, Asana, Basecamp, many others.tips experiences projects involving 2 10 statisticians working upcoming deadline.create error describes problem blocking progress, include raw text (e.g., error: JAVA_HOME determined Registry) possibly screenshot. text allows problem easily searched people later; screenshot usually provides extra context allows understand situation help quickly.create error describes problem blocking progress, include raw text (e.g., error: JAVA_HOME determined Registry) possibly screenshot. text allows problem easily searched people later; screenshot usually provides extra context allows understand situation help quickly.Include enough broad context enough specific details teammates can quickly understand problem. Ideally can even run code debug . Good recommendations can found Stack Overflow posts, ‘make great R reproducible example’ ‘ask good question?’. issues don’t need thorough, teammates start context Stack Overflow reader.\ntypically include\ndescription problem fishy behavior.\nexact error message (good description fishy behavior).\nsnippet 1-10 lines code suspected causing problem.\nlink code’s file (ideally line number, https://github.com/OuhscBbmc/REDCapR/blob/main/R/redcap-version.R#L40) reader can hop entire file.\nreferences similar GitHub Issues Stack Overflow questions aid troubleshooting.\nInclude enough broad context enough specific details teammates can quickly understand problem. Ideally can even run code debug . Good recommendations can found Stack Overflow posts, ‘make great R reproducible example’ ‘ask good question?’. issues don’t need thorough, teammates start context Stack Overflow reader.typically includea description problem fishy behavior.description problem fishy behavior.exact error message (good description fishy behavior).exact error message (good description fishy behavior).snippet 1-10 lines code suspected causing problem.snippet 1-10 lines code suspected causing problem.link code’s file (ideally line number, https://github.com/OuhscBbmc/REDCapR/blob/main/R/redcap-version.R#L40) reader can hop entire file.link code’s file (ideally line number, https://github.com/OuhscBbmc/REDCapR/blob/main/R/redcap-version.R#L40) reader can hop entire file.references similar GitHub Issues Stack Overflow questions aid troubleshooting.references similar GitHub Issues Stack Overflow questions aid troubleshooting.","code":""},{"path":"git.html","id":"git-stability","chapter":"A Git & GitHub","heading":"A.4 for Stability","text":"Review Git commits closely\nunintended functional difference (e.g., !match accidentally changed match).\nPHI snuck (e.g., patient ID used isolating debugging).\nmetadata format didn’t change (e.g., Excel sometimes changes string ‘010’ number ‘10’). See appendix longer discussion problems Excel typically introduces.\nReview Git commits closelyNo unintended functional difference (e.g., !match accidentally changed match).PHI snuck (e.g., patient ID used isolating debugging).metadata format didn’t change (e.g., Excel sometimes changes string ‘010’ number ‘10’). See appendix longer discussion problems Excel typically introduces.","code":""},{"path":"git.html","id":"organization-wide-defaults-and-practices","chapter":"A Git & GitHub","heading":"A.5 Organization-wide defaults and practices","text":"core-wide goal secure default applies GitHub . security measures added explicitly (e.g., .gitignore blocking common data files like *.csv & *.xlsx), organization-wide settings make new repo secure soon initialized, even cost accessibility.DefaultsTwo-factor authentication required organization members outside collaborators. See setting “Security” => “Two-factor authentication”Two-factor authentication required organization members outside collaborators. See setting “Security” => “Two-factor authentication”Organization members restricted creating repositories. See setting “Member privileges” => “Repository creation”.Organization members restricted creating repositories. See setting “Member privileges” => “Repository creation”.Organization members zero permissions new repositories. See setting “Member privileges” => “Default repository permission”\n.Organization members zero permissions new repositories. See setting “Member privileges” => “Default repository permission”\n.PracticesAuthorized teammates outside OUHSC designated outside collaborators, instead “members”.Authorized teammates outside OUHSC designated outside collaborators, instead “members”.three people owners GitHub organization. Everyone else must explicitly added appropriate repository. important restrictions members include () add/delete/transfer (private public) repositories (b) add/delete members organization.three people owners GitHub organization. Everyone else must explicitly added appropriate repository. important restrictions members include () add/delete/transfer (private public) repositories (b) add/delete members organization.Every week, owner (probably (wibeasley?)) review organization’s audit log (owners can view).Every week, owner (probably (wibeasley?)) review organization’s audit log (owners can view).Two owners must discuss agree upon adding/modifying/deleting extra entity added GitHub Organization, including\nwebhooks,\nthird-party applications,\ninstalled integration, \nOAuth applications.\nCurrently, approved entity Codecov integration, helps us test package code quantify coverage (“Improve code quality. Expose bugs security vulnerabilities.”). Codecov must explicitly turned desired repository.Two owners must discuss agree upon adding/modifying/deleting extra entity added GitHub Organization, includingwebhooks,third-party applications,installed integration, andOAuth applications.Currently, approved entity Codecov integration, helps us test package code quantify coverage (“Improve code quality. Expose bugs security vulnerabilities.”). Codecov must explicitly turned desired repository.","code":""},{"path":"git.html","id":"git-collaborators","chapter":"A Git & GitHub","heading":"A.6 for New Collaborators","text":"","code":""},{"path":"git.html","id":"git-contribution","chapter":"A Git & GitHub","heading":"A.7 Steps for Contributing to Repo","text":"","code":""},{"path":"git.html","id":"git-contribution-regular","chapter":"A Git & GitHub","heading":"A.7.1 Regular Contributions","text":"","code":""},{"path":"git.html","id":"git-contribution-regular-pull","chapter":"A Git & GitHub","heading":"A.7.1.1 Keep your dev branch fresh","text":"recommend least every day write code repo. Perhaps frequently lot developers pushing code (e.g., right reporting deadline).Update “main” branch local machine (GitHub server)Merge main local dev branchPush local dev branch GitHub server","code":""},{"path":"git.html","id":"git-contribution-regular-push","chapter":"A Git & GitHub","heading":"A.7.1.2 Make your code contributions available to other analysts","text":"least every days, push changes main branch teammates can benefit work. Especially improving pipeline code (e.g. Ellises REDCap Arches)Make sure dev branch updated immediately create Pull Request. Follow steps .Verify merged code still works expected. words, make sure new code blended newest main code, nothing breaks. Depending repo, steps might include\nBuild Check repo (assuming rep also package).\nRun code verify’s basic functionality repo. (example, MIECHV team run “high-school-funnel.R” verify assertions passed).\nBuild Check repo (assuming rep also package).Run code verify’s basic functionality repo. (example, MIECHV team run “high-school-funnel.R” verify assertions passed).Commit changes dev branch push GitHub server.Create Pull Request (otherwise known PR) assign reviewer. (example, developers MIECHV team paired together review ’s code.)reviewer pull dev branch local machine run checks verification (2nd step ). duplicate effort helps verify code likely works everyone machines.reviewer accepts PR main branch now contains changes available teammates.","code":""},{"path":"git.html","id":"main-vs-master-branch","chapter":"A Git & GitHub","heading":"A.7.1.3 “Main” vs “Master” Branch","text":"using old repo (initialized 2021) whose default branch still called “master”, ’s fairly simple rename “main” server.client, two options. first delete reclone (make sure everything pushed central repo deleting). second open command prompt (Window’s cmd, Window’s PowerShell, Linux bash) paste four lines.","code":"git branch -m master main\ngit fetch origin\ngit branch -u origin/main main\ngit remote set-head origin -a"},{"path":"git.html","id":"repo-style","chapter":"A Git & GitHub","heading":"A.8 Repo Style","text":"Please see Code Repositories section Style Guide chapter.{Transfer & update material https://github.com/OuhscBbmc/BbmcResources/blob/main/instructions/github.md}","code":""},{"path":"regex.html","id":"regex","chapter":"B Regular Expressions","heading":"B Regular Expressions","text":"“regular expression” (commonly called “regex”) allows programmer leverage pattern identifies (possibly extracts) nuggets information buried within text fields otherwise unparsable. can’t comfortable regexes data sciencing. learn new regex capabilities, ’ll see opportunities extract information efficiency integrity.Regexes may confusing first (may always remain little confusing) following resources help become proficient.Tools:http://regex101.com easy tool developing testing regex patterns replacements. Cool features include () panel thorough explanation every characteristic regex (b) ability save regex publicly share collaborators. supports different flavors –latest PCRE version corresponds R’s regex engine.\ntransferring regex website R, don’t forget “backslash backslashes”. words, regex pattern \\d{3} (matches three consecutive digits), declare R variable pattern <- \"\\\\d{3}\".http://regex101.com easy tool developing testing regex patterns replacements. Cool features include () panel thorough explanation every characteristic regex (b) ability save regex publicly share collaborators. supports different flavors –latest PCRE version corresponds R’s regex engine.transferring regex website R, don’t forget “backslash backslashes”. words, regex pattern \\d{3} (matches three consecutive digits), declare R variable pattern <- \"\\\\d{3}\".Books:Regular Expressions Chapter R Data Science, 2nd edition.Introducing Regular ExpressionsRegular Expressions Cookbook, 2nd EditionMastering Regular Expressions, 3rd editionPresentations:Regex SCUG Presentation","code":""},{"path":"snippets.html","id":"snippets","chapter":"C Snippets","heading":"C Snippets","text":"","code":""},{"path":"snippets.html","id":"snippets-reading","chapter":"C Snippets","heading":"C.1 Reading External Data","text":"","code":""},{"path":"snippets.html","id":"snippets-reading-excel","chapter":"C Snippets","heading":"C.1.1 Reading from Excel","text":"Background: Avoid Excel reasons previously discussed. isn’t another good option, protective. readxl::read_excel() allows specify column types, column order. names col_types ignored readxl::read_excel(). defend roaming columns (e.g., files changed time), tesit::assert() order expect.See readxl vignette, Cell Column Types, info.Last Modified: 2019-12-12 ","code":"\n# ---- declare-globals ---------------------------------------------------------\nconfig                         <- config::get()\n\n# cat(sprintf('  `%s`             = \"text\",\\n', colnames(ds)), sep=\"\") # 'text' by default --then change where appropriate.\ncol_types <- c(\n  `Med Rec Num`     = \"text\",\n  `Admit Date`      = \"date\",\n  `Tot Cash Pymt`   = \"numeric\"\n)\n\n# ---- load-data ---------------------------------------------------------------\nds <- readxl::read_excel(\n  path      = config$path_admission_charge,\n  col_types = col_types\n  # sheet   = \"dont-use-sheets-if-possible\"\n)\n\ntestit::assert(\n  \"The order of column names must match the expected list.\",\n  names(col_types) == colnames(ds)\n)\n\n# Alternatively, this provides more detailed error messages than `testit::assert()`\n# testthat::expect_equal(\n#   colnames(d),\n#   names(col_types),\n#   label = \"worksheet's column name (x)\",\n#   expected.label = \"col_types' name (y)\"\n# )"},{"path":"snippets.html","id":"snippets-reading-trailing-comma","chapter":"C Snippets","heading":"C.1.2 Removing Trailing Comma from Header","text":"Background: Occasionally Meditech Extract extra comma end 1st line. subsequent line, readr:read_csv() appropriately throws new warning missing column. warning flood can mask real problems.Explanation: snippet () reads csv plain text, (b) removes final comma, (c) passes plain text readr::read_csv() convert data.frame.Instruction: Modify Dx50 Name name final (real) column.Real Example: truong-pharmacist-transition-1 (Accessible CDW members.)Last Modified: 2019-12-12 ","code":"\n# The next two lines remove the trailing comma at the end of the 1st line.\nraw_text  <- readr::read_file(path_in)\nraw_text  <- sub(\"^(.+Dx50 Name),\", \"\\\\1\", raw_text)\n\nds        <- readr::read_csv(raw_text, col_types=col_types)"},{"path":"snippets.html","id":"snippets-reading-vroom","chapter":"C Snippets","heading":"C.1.3 Removing Trailing Comma from Header","text":"Background: incoming data files large side comfortably accept readr, use vroom. two packages developed group might combined future.Explanation: snippet defines col_types list names mimic approach using readr. small differences readr approach:\n1. col_types list instead readr::cols_only object.\n1. call vroom::vroom() passes col_names = names(col_types) explicitly.\n1. data file contains columns don’t need, define col_types anyway; vroom needs know file structure ’s missing header row.Real Example: akande-medically-complex-1 (Accessible CDW members.) Thesee files header variable names; first line file first data row.Last Modified: 2020-08-21 ","code":"\n# ---- declare-globals ---------------------------------------------------------\nconfig            <- config::get()\n\ncol_types <- list(\n  sak                      = vroom::col_integer(),  # \"system-assigned key\"\n  aid_category_id          = vroom::col_character(),\n  age                      = vroom::col_integer(),\n  service_date_first       = vroom::col_date(\"%m/%d/%Y\"),\n  service_date_lasst       = vroom::col_date(\"%m/%d/%Y\"),\n  claim_type               = vroom::col_character(),\n  provider_id              = vroom::col_character(),\n  provider_lat             = vroom::col_double(),\n  provider_long            = vroom::col_double(),\n  provider_zip             = vroom::col_character(),\n  cpt                      = vroom::col_integer(),\n  revenue_code             = vroom::col_integer(),\n  icd_code                 = vroom::col_character(),\n  icd_sequence             = vroom::col_integer(),\n  vocabulary_coarse_id     = vroom::col_integer()\n)\n\n# ---- load-data ---------------------------------------------------------------\nds <- vroom::vroom(\n  file      = config$path_ohca_patient,\n  delim     = \"\\t\",\n  col_names = names(col_types),\n  col_types = col_types\n)\n\nrm(col_types)"},{"path":"snippets.html","id":"snippets-row","chapter":"C Snippets","heading":"C.2 Row Operations","text":"frequently find mean sum across columns (within row).\n\nFinding mean across lot columnsHere several approaches finding mean across columns, without naming column. remarks:m1 & m2 sanity checks example.m1 clumsy 10+ items.m2 discouraged ’s brittle.\nchange column order alter calculation.\nprefer use grep() specify sequence items.Especially large datasets,\n’d lean towards m3 items reasonably complete \nm4 participants missing enough items summary score fishy.\napproaches , m4 m6 return mean participant completed 2 items.dplyr::rowwise() convenient, slow large datasets.need complex function ’s clumsy include directly mutate() statement,\nsee calculation m6 delegated external function, f6.technique behind nonmissing pretty cool,\ncan apply arbitrary function cell ’re summed/averaged.contrast f6(), applies entire (row-wise) data.frame.","code":"\n# Isolate the columns to average.  Remember the `grep()` approach w/ `colnames()`\ncolumns_to_average <- c(\"hp\", \"drat\", \"wt\")\n\nf6 <- function(x) {\n  # browser()\n  s <- sum(x, na.rm = TRUE)\n  n <- sum(!is.na(x))\n  \n  dplyr::if_else(\n    2L <= n,\n    s / n,\n    NA_real_\n  )\n}\n\nmtcars |>\n  dplyr::mutate(\n    m1 = (hp + drat + wt) / 3,\n    m2 =\n      rowMeans(\n        dplyr::across(hp:wt), # All columns between hp & wt.\n        na.rm = TRUE\n      ),\n    m3 =\n      rowMeans(\n        dplyr::across(!!columns_to_average),\n        na.rm = TRUE\n      ),\n    s4 = # Finding the sum (used by m4)\n      rowSums(\n        dplyr::across(!!columns_to_average),\n        na.rm = TRUE\n      ),\n    nonmissing =\n      rowSums(\n        dplyr::across(\n          !!columns_to_average,\n          .fns = \\(x) { !is.na(x) }\n        )\n      ),\n    m4 = \n      dplyr::if_else(\n        2 <= nonmissing,\n        s4 / nonmissing,\n        NA_real_\n      )\n  ) |>\n  dplyr::rowwise() |> # Required for `m5`\n  dplyr::mutate(\n    m5 = mean(dplyr::c_across(dplyr::all_of(columns_to_average))),\n  ) |>\n  dplyr::ungroup() |> # Clean up after rowwise()\n  dplyr::rowwise() |> # Required for `m6`\n  dplyr::mutate(\n    m6 = f6(dplyr::across(!!columns_to_average))\n  ) |>\n  dplyr::ungroup() |>   # Clean up after rowwise()\n  dplyr::select(\n    hp,\n    drat,\n    wt,\n    m1,\n    m2, \n    m3, \n    s4,\n    nonmissing,\n    m4,\n    m5, \n    m6,\n  )"},{"path":"snippets.html","id":"snippets-grooming","chapter":"C Snippets","heading":"C.3 Grooming","text":"","code":""},{"path":"snippets.html","id":"snippets-grooming-two-year","chapter":"C Snippets","heading":"C.3.1 Correct for misinterpreted two-digit year","text":"Background: Sometimes Meditech dates specified like 1/6/54 instead 1/6/1954. readr::read_csv() choose year supposed ‘1954’ ‘2054’. human can use context guess birth date past (guesses 1954), readr can’t (guesses 2054). avoid problems, request dates ISO-8601 format.Explanation: Correct dplyr::mutate() clause; compare date value today. date today , use ; day future, subtract 100 years.Instruction: future dates loan payments, direction flip.Last Modified: 2019-12-12 ","code":"\n ds |>\n dplyr::mutate(\n    dob = dplyr::if_else(dob <= Sys.Date(), dob, dob - lubridate::years(100))\n  )"},{"path":"snippets.html","id":"snippets-identification","chapter":"C Snippets","heading":"C.4 Identification","text":"","code":""},{"path":"snippets.html","id":"snippets-identification-tags","chapter":"C Snippets","heading":"C.4.1 Generating “tags”","text":"Background: need generate unique identification values future people/clients/patients, described style guide.Explanation: snippet create 5-row csv random 7-character “tags” send research team collecting patients. TheInstruction: Set pt_count, tag_length, path_out, execute. Add rename columns appropriate domain (e.g., change “patient tag” “store tag”).Last Modified: 2019-12-30 WillThe resulting dataset look like , different randomly-generated tags.","code":"\npt_count    <- 5L   # The number of rows in the dataset.\ntag_length  <- 7L   # The number of characters in each tag.\npath_out    <- \"data-private/derived/pt-pool.csv\"\n\ndraw_tag <- function (tag_length = 4L, urn = c(0:9, letters)) {\n  paste(sample(urn, size = tag_length, replace = T), collapse = \"\")\n}\n\nds_pt_pool <-\n  tibble::tibble(\n    pt_index    = seq_len(pt_count),\n    pt_tag      = vapply(rep(tag_length, pt_count), draw_tag, character(1)),\n    assigned    = FALSE,\n    name_last   = \"--\",\n    name_first  = \"--\"\n  )\n\nreadr::write_csv(ds_pt_pool, path_out)# A tibble: 5 x 5\n  pt_index pt_tag  assigned name_last name_first\n                  \n1        1 seikyfr FALSE    --        --\n2        2 voiix4l FALSE    --        --\n3        3 wosn4w2 FALSE    --        --\n4        4 jl0dg84 FALSE    --        --\n5        5 r5ei5ph FALSE    --        --"},{"path":"snippets.html","id":"snippets-correspondence","chapter":"C Snippets","heading":"C.5 Correspondence with Collaborators","text":"","code":""},{"path":"snippets.html","id":"snippets-correspondence-excel","chapter":"C Snippets","heading":"C.5.1 Excel files","text":"Receiving storing Excel files almost always avoided reasons explained letter.receive extracts Excel files frequently, following request ready email person sending us Excel files. Adapt bold values like “109.19” situation. familiar tools, suggest alternative saving file csv. presented Excel gotchas, almost everyone ‘aha’ moment recognizes problem. Unfortunately, everyone flexible software can adapt easily.[Start letter]Sorry tedious, please resend extract csv file? Please call questions.Excel helpful values, essentially corrupting . example, values like 109.19 interpreted number, character code (e.g., see cell L14). limitations finite precision, becomes 109.18999999999999773. can’t round , values column cast numbers, V55.0. Furthermore, “E”s codes incorrectly interpreted exponent operator (e.g., “4E5” converted 400,000).\nFinally, values like 001.0 converted number leading trailing zeros dropped (cells like “1” distinguishable “001.0”).Unfortunately problems exist Excel file . import columns text, values already corrupted state.Please compress/zip csv file large email. ’ve found Excel file typically 5-10 times larger compressed csv.much Excel interferes medical variables, ’re lucky. messed branches science much worse. Genomics using far late realized mistakes.happened? default, Excel popular spreadsheet applications convert gene symbols dates numbers. example, instead writing “Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase,” researchers dubbed gene MARCH1. Excel converts date—03/01/2016, say—’s probably majority spreadsheet users mean type cell. Similarly, gene identifiers like “2310009E13” converted exponential numbers (2.31E+19). cases, conversions strip valuable information genes question.[End letter]","code":""},{"path":"presentations.html","id":"presentations","chapter":"D Presentations","heading":"D Presentations","text":"collection presentations BBMC friends may help demonstrate concepts discussed previous chapters.","code":""},{"path":"presentations.html","id":"presentations-crdw","chapter":"D Presentations","heading":"D.1 CRDW","text":"prairie-outpost-public: Documentation starter files OUHSC’s Clinical Data Warehouse.OUHSC CDW","code":""},{"path":"presentations.html","id":"presentations-redcap","chapter":"D Presentations","heading":"D.2 REDCap","text":"Secure Medical Data Collection - Best Practices Excel, Leveling REDCap & CollaboratoR. R/Medicine 2021, Virtual. Accompanying vignette: Typical REDCap Workflow Data Analyst.Secure Medical Data Collection - Best Practices Excel, Leveling REDCap & CollaboratoR. R/Medicine 2021, Virtual. Accompanying vignette: Typical REDCap Workflow Data Analyst.REDCap Systems Integration. REDCap Con 2015, Portland, Oregon.REDCap Systems Integration. REDCap Con 2015, Portland, Oregon.Literate Programming Patterns Practices REDCap REDCap Con 2014, Park City, Utah.Literate Programming Patterns Practices REDCap REDCap Con 2014, Park City, Utah.Interacting REDCap API using REDCapR Package REDCap Con 2014, Park City, Utah.Interacting REDCap API using REDCapR Package REDCap Con 2014, Park City, Utah.Optimizing Study Management using REDCap, R, software tools. SCUG 2013.Optimizing Study Management using REDCap, R, software tools. SCUG 2013.","code":""},{"path":"presentations.html","id":"presentations-reproducible","chapter":"D Presentations","heading":"D.3 Reproducible Research & Visualization","text":"Building pipelines dashboards practitioners: Mobilizing knowledge reproducible reporting. Displaying Health Data Colloquium 2018, University Victoria.Interactive reports webpages R & Shiny. SCUG 2015.Big data, big analysis: collaborative framework multistudy replication. Conventional Canadian Psychological Association, Victoria BC, 2016.WATS: wrap-around time series: Code accompany WATS Plot article, 2014.","code":""},{"path":"presentations.html","id":"presentations-data-management","chapter":"D Presentations","heading":"D.4 Data Management","text":"BBMC Validator: catch communicate data errors. SCUG 2016.Text manipulation Regular Expressions, Part 1 Part 2. SCUG 2016.Time Effort Data Synthesis. SCUG 2015.","code":""},{"path":"presentations.html","id":"presentations-github","chapter":"D Presentations","heading":"D.5 GitHub","text":"Scientific Collaboration GitHub. OU Bioinformatics Breakfast Club 2015.","code":""},{"path":"presentations.html","id":"presentations-software","chapter":"D Presentations","heading":"D.6 Software","text":"REDCapR: Interaction R REDCap.OuhscMunge: Data manipulation operations commonly used Biomedical Behavioral Methodology Core within Department Pediatrics University Oklahoma Health Sciences Center.codified: Produce standard/formalized demographics tables.usnavy billets: Optimally assigning naval officers billets.","code":""},{"path":"presentations.html","id":"presentations-architecture","chapter":"D Presentations","heading":"D.7 Architectures","text":"Linear Pipeline R Analysis Skeleton\n\n.\nLinear Pipeline R Analysis Skeleton\n.\nMany--many Pipeline R Analysis Skeleton\n\n.\nMany--many Pipeline R Analysis Skeleton\n.\nImmunization transfer\n\n.\nImmunization transfer\n.\nIALSA: Collaborative Modeling Framework Multi-study Replication\n\n.\nIALSA: Collaborative Modeling Framework Multi-study Replication\n.\nPOPS: Automated daily screening eligibility rare understudied prescriptions.\n\n.\nPOPS: Automated daily screening eligibility rare understudied prescriptions.\n.\n","code":""},{"path":"presentations.html","id":"presentations-components","chapter":"D Presentations","heading":"D.8 Components","text":"Customizing display tables: using css DT kableExtra. SCUG 2018.yaml expandable trees selectively show subsets hierarchy, 2017.","code":""},{"path":"scratch-pad.html","id":"scratch-pad","chapter":"E Scratch Pad of Loose Ideas","heading":"E Scratch Pad of Loose Ideas","text":"","code":""},{"path":"scratch-pad.html","id":"chapters-sections-to-form","chapter":"E Scratch Pad of Loose Ideas","heading":"E.1 Chapters & Sections to Form","text":"Tools Consider\ntidyverse\nodbc\nTools Considertidyverseodbcggplot2\nuse factors explanatory variables want keep order consistent across graphs. (genevamarshall)\nggplot2use factors explanatory variables want keep order consistent across graphs. (genevamarshall)automation remote server VDI\n’s always chance machine configured little differently , may affect results. glance results ? forgot project , wouldn’t able spot problems like can. S drive file tables don’t seem obvious problemsautomation remote server VDIThere’s always chance machine configured little differently , may affect results. glance results ? forgot project , wouldn’t able spot problems like can. S drive file tables don’t seem obvious problemspublic reports (dashboards)\ndeveloping report external audience (ie, people outside immediate research team), choose one two pals unfamiliar aims/methods impromptu focus group. Ask things need redesigned/reframed/reformated/-explained. (genevamarshall)\nplots\nplot labels/axes\nvariable names\nunits measurement (eg, proportion vs percentage y axis)\n\npublic reports (dashboards)developing report external audience (ie, people outside immediate research team), choose one two pals unfamiliar aims/methods impromptu focus group. Ask things need redesigned/reframed/reformated/-explained. (genevamarshall)\nplots\nplot labels/axes\nvariable names\nunits measurement (eg, proportion vs percentage y axis)\nplotsplot labels/axesvariable namesunits measurement (eg, proportion vs percentage y axis)documentation - bookdown\n\nBookdown worked well us far. ’s basically independent markdown documents stored dedicated git repo. click “build” RStudio converts markdown files static html files. GitHub essentially serving backend, everyone can make changes sections don’t worried \n’s version ’s hosted publicly, tested can hosted shared file server. (’s possible html files static.) guys want OU’s collective CDW, please tell :\nwant able edit documents without review. ’ll add GitHub repo.\nwant able view documents. ’ll add dedicate file server space.\nhttps://ouhscbbmc.github.io/data-science-practices-1/workstation.html#installation-required\nthinking individual database gets chapter. BBMC ~4 databases sense: Centricity staging database, GECB staging database, central warehouse, (fledgling) downstream OMOP database. ~3 sections within chapter: () black--white description tables, columns, & indexes (written mostly consumers), (b) recommendations use table (written mostly consumers), (c) description ETL process (written mostly developers & admins).\nproposal uses GitHub Markdown ’re universal (knowledge R required –really write text editor & commit, let someone else click “build” RStudio machine). ’m flexible . ’ll support & contribute system guys feel work well across teams.\ndocumentation - bookdownBookdown worked well us far. ’s basically independent markdown documents stored dedicated git repo. click “build” RStudio converts markdown files static html files. GitHub essentially serving backend, everyone can make changes sections don’t worried aboutHere’s version ’s hosted publicly, tested can hosted shared file server. (’s possible html files static.) guys want OU’s collective CDW, please tell :want able edit documents without review. ’ll add GitHub repo.want able view documents. ’ll add dedicate file server space.https://ouhscbbmc.github.io/data-science-practices-1/workstation.html#installation-requiredI thinking individual database gets chapter. BBMC ~4 databases sense: Centricity staging database, GECB staging database, central warehouse, (fledgling) downstream OMOP database. ~3 sections within chapter: () black--white description tables, columns, & indexes (written mostly consumers), (b) recommendations use table (written mostly consumers), (c) description ETL process (written mostly developers & admins).proposal uses GitHub Markdown ’re universal (knowledge R required –really write text editor & commit, let someone else click “build” RStudio machine). ’m flexible . ’ll support & contribute system guys feel work well across teams.developing packages\nR packages Hadley Wickham\nhttp://mangothecat.github.io/goodpractice/\ndeveloping packagesR packages Hadley WickhamR packages Hadley Wickhamhttp://mangothecat.github.io/goodpractice/http://mangothecat.github.io/goodpractice/Cargo cult programming “style computer programming characterized ritual inclusion code program structures serve real purpose.” (Wikipedia)\nteam decide elements file prototype repo prototype best .Cargo cult programming “style computer programming characterized ritual inclusion code program structures serve real purpose.” (Wikipedia)team decide elements file prototype repo prototype best .","code":""},{"path":"scratch-pad.html","id":"practices","chapter":"E Scratch Pad of Loose Ideas","heading":"E.2 Practices","text":".exit() add = TRUE (Wickham (2019), Exit handlers).","code":""},{"path":"scratch-pad.html","id":"good-sites","chapter":"E Scratch Pad of Loose Ideas","heading":"E.3 Good Sites","text":"Posts sites almost always worth time reading. frequently improve develop common components used data pipelines.Yihui Xie, created knitr important contributions reproducible research.RStudio, addition IDE, many packages used created developers.Explain xkcd ’s good.Occasionally skim titles sites pick relevant interests. think helps keep aware developments field, skills continually grow approaches don’t become stagnant.O’Reilly’s Data science ideas resourcesTowards Data ScienceThese books haven’t referenced (yet), good guidance worth time skimming see relevant.Tidynomicon Dhavide Aruliah & Greg WilsonThe Tidynomicon Dhavide Aruliah & Greg WilsonEfficient R programming Colin Gillespie & Robin LovelaceEfficient R programming Colin Gillespie & Robin LovelaceMastering Software Development RMastering Software Development R","code":""},{"path":"example-dashboard.html","id":"example-dashboard","chapter":"F Example Dashboard","heading":"F Example Dashboard","text":"Communicating quantitative trends community quantitative phobia can difficult. appendix showcases dashboard style evolved past years OSDH Home Visiting, twelve local programs practitioners implemented intervention ideas tailored interests community.50 dashboards developed: custom dashboard developed program’s cycle, three additional dashboards communicate results program-agnostic investigations. style guide important tool managing many unique investigationsFor program-specific dashboard, ’s important meet needs individual PDSA conform guide. However, aim make dashboards consistent possible several reasons:’s less work practitioners. familiar presentation help practitioners grow comfortable new cycle’s dashboard. Recall use least five dashboards years.’s less work analysts/developers. Within cycle, consistent format (relatively interchangeable features) means one analyst can easily contribute trouble shoot colleague’s dashboard.lessons ’ve learned (mistakes ’ve made) can applied later dashboards. quality improve development quicken.Just like CQI grant encourages HV program learn history learn others, analysts . work programs design PDSA, one analyst learn strengths weaknesses current dashboard style, propose improvements.","code":""},{"path":"example-dashboard.html","id":"example-dashboard-example","chapter":"F Example Dashboard","heading":"F.1 Example","text":"example dashboard mimic real CQI available https://ouhscbbmc.github.io/data-science-practices-1/dashboard-1.html. dashboard source code available analysis/dashboard-1 directory R Analysis Skeleton repository’; repo contains code documents entire pipeline leading dashboard.’ve success developing distributing dashboards self-contained html files. portable don’t dependencies local data files remote databases, yet JavaScript CSS provide modest amount interactivity. dashboard’s principal components flexdashboard, plotly, ggplot2, R Markdown.dashboard synthetic data, cognitive measure tracked across 14 years three home visiting counties.","code":""},{"path":"example-dashboard.html","id":"example-dashboard-guide","chapter":"F Example Dashboard","heading":"F.2 Style Guide","text":"section describes set practices BBMC analysts decided best CQI dashboards used MIECHV evaluations. sense, CQI dashboard guide supplements overall style guide.MIECHV CQI dashboards based RStudio’s flexdashboard package, uses rmarkdown, JavaScript, CSS. flexdashboard great website read anyone adapting guide CQI projects.","code":""},{"path":"example-dashboard.html","id":"headline-page","chapter":"F Example Dashboard","heading":"F.2.1 Headline page","text":"\ndashboard’s greeting good blend () orientating user context (b) welcoming overwhelming. second PDSA cycle, try one two important impactful graphs first page; specialized graphs pages later.Left column: Text qualified {.tabset}\nNotes tab: text provides info dashboard’s dataset, \nCount () models, (b) programs, (c) clients, (d) observations\nDate range\nspecific program_codes. Even though PDSA focused specific program, ideally programs included feel others .\n\nNotes tab: text provides info dashboard’s dataset, \nCount () models, (b) programs, (c) clients, (d) observations\nDate range\nspecific program_codes. Even though PDSA focused specific program, ideally programs included feel others .\nCount () models, (b) programs, (c) clients, (d) observationsDate rangeThe specific program_codes. Even though PDSA focused specific program, ideally programs included feel others .Right column: Headline Graph(s) optionally qualified {.tabset}.\nIdeally starts overall graph, longitudinal component.\nShow data program, overall model.\nIdeally starts overall graph, longitudinal component.Show data program, overall model.","code":""},{"path":"example-dashboard.html","id":"tables-page","chapter":"F Example Dashboard","heading":"F.2.2 Tables page","text":"\ntables provide exactness, especially exactness () actual y value (b) frequency longitudinal values. tables make easier see ’re inadvertently plotting multiple values month, month missing. future, can add ‘Download CSV’ button anyone requests .Another advantage tables measures visible screen. typical program-month table least 6 columns: program_code, month, model, outcome measure, process measure, disruptor measure. difficult , upstream scribe probably isn’t job well. tables almost untouched rds files created ‘load-data’ chunk.tab represent different unit analysis (e.g., single row summarizing completed visits program-month). Use tabs appropriate PDSA. Go biggest unit (e.g., model) smallest unit (e.g., Provider-Week).Unnamed column qualified {.tabset}.\nModel tab\nProgram tab\nProgram-Month tab\nProgram-Week tab\nProvider-Week tab\nSpaghetti Annotation tab spaghetti plots use faint vertical lines mark events (e.g., start PDSA intervention), include events .\nUnnamed column qualified {.tabset}.Model tabProgram tabProgram-Month tabProgram-Week tabProvider-Week tabSpaghetti Annotation tab spaghetti plots use faint vertical lines mark events (e.g., start PDSA intervention), include events .","code":""},{"path":"example-dashboard.html","id":"graphs-page","chapter":"F Example Dashboard","heading":"F.2.3 Graphs page","text":"\ngraphs plots provide user feel trends. One graph focuses one measure, ideally max three spaghetti plots. Ideally change time (PDSA’s program) compared programs period. PSDA multiple Process Measures, give separate tabs labeled ‘Process Measure 1’ & ‘Process Measure 2’.Unnamed column qualified {.tabset}.\nOutcome Measure tab\nProcess Measure tab\nDisruptor Measure tab\nUnnamed column qualified {.tabset}.Outcome Measure tabProcess Measure tabDisruptor Measure tabIf spaghetti plot depicts proportion/percentage measure, include visual layer count/denominator behind proportion (instead separate spaghetti plot dedicated denominator). may include one following:geom_point presence/absence denotes nonzero/zero denominatorgeom_point size denotes denominator’s size.geom_text (place geom_point) explicitly shows denominator’s sizegeom_text along bottom axis explicitly shows denominator’s sizeuse spaghetti_2() located display-1.R. (yet developed.) Add hover text spaghetti.","code":""},{"path":"example-dashboard.html","id":"marginal-graphs-page","chapter":"F Example Dashboard","heading":"F.2.4 Marginal Graphs page","text":"\nmarginal histograms provide context.Single column, qualified {.tabset}.Single column, qualified {.tabset}.Contains marginal/univariate graph variables analysis.\nMarginal graph outcome measure\nMarginal graph process measure\nMarginal graph disruptor measure\nContains marginal/univariate graph variables analysis.Marginal graph outcome measureMarginal graph process measureMarginal graph disruptor measureShow data program, overall model.Show data program, overall model.Use histogram_2() located display-1.R (link accessible Oklahoma’s MIECHV evaluation team). Add hover text histogram.Use histogram_2() located display-1.R (link accessible Oklahoma’s MIECHV evaluation team). Add hover text histogram.datasets unit analysis (e.g., ‘program-month’), don’t use H3 tab. Use (H3) tabs marginals one level (e.g., visit date program-month, visit date program-week, visit date provider-week). avoid multiple levels, possible; especially program isn’t fluent single level.datasets unit analysis (e.g., ‘program-month’), don’t use H3 tab. Use (H3) tabs marginals one level (e.g., visit date program-month, visit date program-week, visit date provider-week). avoid multiple levels, possible; especially program isn’t fluent single level.histograms specific y-axis. example, “Count Months” instead “Frequency”histograms specific y-axis. example, “Count Months” instead “Frequency”","code":""},{"path":"example-dashboard.html","id":"documentation-page","chapter":"F Example Dashboard","heading":"F.2.5 Documentation page","text":"\ndocumentation self-contained html file, ’s easier practitioner quickly get explanation return trends.Sometimes ’s best place explanation/annotation right next relevant content, times ’s distracting. ’s always work maintain explanations ’re spread-across interface. let’s try keeping almost everything one two tabs Documentation page.help beyond , let’s try reuse many documentation tabs possible. first tab specific methodology displays PDSA. remaining tabs reference common Rmd files; content automatically update dashboard rendered next.Unnamed column qualified {.tabset}.\nExplanation –Current PDSA\nExplanation –CQI Dashboards\nGlossary\nTips\nConfig\nUnnamed column qualified {.tabset}.Explanation –Current PDSAExplanation –CQI DashboardsGlossaryTipsConfig","code":""},{"path":"example-dashboard.html","id":"miscellaneous-notes","chapter":"F Example Dashboard","heading":"F.2.6 Miscellaneous Notes","text":"hierarchy level outline indicates HTML-heading level. Numbers H1 (.e., ======) specify pages, roman numerals H2 (.e., ------) specify columns, letters H3 (.e., ###) specify tabs.hierarchy level outline indicates HTML-heading level. Numbers H1 (.e., ======) specify pages, roman numerals H2 (.e., ------) specify columns, letters H3 (.e., ###) specify tabs.Cosmetics connote type dashboard. Specify using theme css yaml keywords Rmd header.\nCommon measures: theme: simplex uses red banner.\n1st cycle PDSAs (.e., initial cycle MIECHV 3): theme: cosmo uses blue banner. default used theme specified.\n2nd cycle PDSAs: theme: flatly uses turquoise banner.\n3rd cycle PDSAs: theme: journal uses light red banner.\n4th cycle PDSAs (.e., initial cycle MIECHV 5): custom css purple banner (public copy css available). Instead theme, line (four leading spaces, yaml entry nested output flexdashboard::flex_dashboard)\n    css: ../../common/style-cqi-cycle-4.css\nCosmetics connote type dashboard. Specify using theme css yaml keywords Rmd header.Common measures: theme: simplex uses red banner.Common measures: theme: simplex uses red banner.1st cycle PDSAs (.e., initial cycle MIECHV 3): theme: cosmo uses blue banner. default used theme specified.1st cycle PDSAs (.e., initial cycle MIECHV 3): theme: cosmo uses blue banner. default used theme specified.2nd cycle PDSAs: theme: flatly uses turquoise banner.2nd cycle PDSAs: theme: flatly uses turquoise banner.3rd cycle PDSAs: theme: journal uses light red banner.3rd cycle PDSAs: theme: journal uses light red banner.4th cycle PDSAs (.e., initial cycle MIECHV 5): custom css purple banner (public copy css available). Instead theme, line (four leading spaces, yaml entry nested output flexdashboard::flex_dashboard)\n    css: ../../common/style-cqi-cycle-4.css4th cycle PDSAs (.e., initial cycle MIECHV 5): custom css purple banner (public copy css available). Instead theme, line (four leading spaces, yaml entry nested output flexdashboard::flex_dashboard)","code":"    css: ../../common/style-cqi-cycle-4.css"},{"path":"example-dashboard.html","id":"example-dashboard-architecture","chapter":"F Example Dashboard","heading":"F.3 Architecture","text":"dashboard one piece large workflow. design construction workflow discussed book, highlighted .\n.\n","code":""},{"path":"example-dashboard.html","id":"data-from-external-system","chapter":"F Example Dashboard","heading":"F.3.1 Data from External System","text":"","code":""},{"path":"example-dashboard.html","id":"groomed-data-in-warehouse","chapter":"F Example Dashboard","heading":"F.3.2 Groomed Data in Warehouse","text":"","code":""},{"path":"example-dashboard.html","id":"analysis-ready-dataset","chapter":"F Example Dashboard","heading":"F.3.3 Analysis-Ready Dataset","text":"little data manipulation occur dashboard. upstream scribe produce analysis-ready rds file. dashboard concerned presenting graphs, tables, summary text, documentation.little data manipulation occur dashboard. upstream scribe produce analysis-ready rds file. dashboard concerned presenting graphs, tables, summary text, documentation.Include common measure PDSA explicitly mentions . Try show measures ’re directly related PDSA. PDSA dashboard less exposure change (makes easier maintain). program needs context measures, can look common measure dashboard.Include common measure PDSA explicitly mentions . Try show measures ’re directly related PDSA. PDSA dashboard less exposure change (makes easier maintain). program needs context measures, can look common measure dashboard.","code":""},{"path":"example-chapter.html","id":"example-chapter","chapter":"G Example Chapter","heading":"G Example Chapter","text":"intro copied 1st chapter example bookdown repo. ’m keeping temporarily reference.can label chapter section titles using {#label} , e.g., can reference Intro Chapter. manually label , automatic labels anywayFigures tables captions placed figure table environments, respectively.\nFigure G.1: nice figure!\nReference figure code chunk label fig: prefix, e.g., see Figure G.1. Similarly, can reference tables generated knitr::kable(), e.g., see Table G.1.Table G.1: nice table!can write citations, . example, using bookdown package (Xie 2023) sample book, built top R Markdown knitr (Xie 2015).","code":"\npar(mar = c(4, 4, .1, .1))\nplot(pressure, type = 'b', pch = 19)\nknitr::kable(\n  head(iris, 20), caption = 'Here is a nice table!',\n  booktabs = TRUE\n)"},{"path":"acknowledgements.html","id":"acknowledgements","chapter":"H Acknowledgements","heading":"H Acknowledgements","text":"authors thank colleagues discussions experiences data science lead book. OUHSC, includes\n@adrose,\n@aggie-dbc,\n@ARPeters,\n@Ashley-Jorgensen,\n@athumann,\n@atreat1,\n@caston60,\n@chanukyalakamsani,\n@CWilliamsOUHSC,\n@DavidBard,\n@evoss1,\n@genevamarshall,\n@Maleeha,\n@man9472,\n@rmatkins,\n@sbohora,\n@thomasnwilson,\n@vimleshbavadiya,\n@waleboro,\n@YuiYamaoka,\n@yutiantang.Outside OUHSC, includes@andkov,\n@ben519,\n@cscherrer,\n@cmodzelewski,\n@jimquallen,\n@mhunter1,\n@probinso,\n@russelljonas, \n@spopovych.`r (knitr::is_html_output()) ’","code":""},{"path":"references.html","id":"references","chapter":"I References","heading":"I References","text":"","code":""}]
diff --git a/docs/snippets.html b/docs/snippets.html
index b651a3d..d38969d 100644
--- a/docs/snippets.html
+++ b/docs/snippets.html
@@ -212,38 +212,136 @@ 

rm(col_types)

-
+

-C.2 Grooming +C.2 Row Operations

-
+

We frequently have to find the mean or sum across columns (within a row).
+If +Finding mean across a lot of columns

+

Here are several approaches for finding the mean across columns, without naming each column. Some remarks:

+
    +
  • +m1 & m2 are sanity checks for this example.
    m1 would be clumsy if you have 10+ items.
    m2 is discouraged because it’s brittle.
    +A change in the column order could alter the calculation. +We prefer to use grep() to specify a sequence of items.
  • +
  • Especially for large datasets, +I’d lean towards m3 if the items are reasonably complete and +m4 if some participants are missing enough items that their summary score is fishy. +In the approaches below, m4 and m6 return the mean only if the participant completed 2 or more items.
  • +
  • +dplyr::rowwise() is convenient, but slow for large datasets.
  • +
  • If you need a more complex function that’s too clumsy to include directly in a mutate() statement, +see how the calculation for m6 is delegated to the external function, f6.
  • +
  • The technique behind nonmissing is pretty cool, +because you can apply an arbitrary function on each cell before they’re summed/averaged.
    +
  • +
  • This is in contrast to f6(), which applies to an entire (row-wise) data.frame.
  • +
+
+# Isolate the columns to average.  Remember the `grep()` approach w/ `colnames()`
+columns_to_average <- c("hp", "drat", "wt")
+
+f6 <- function(x) {
+  # browser()
+  s <- sum(x, na.rm = TRUE)
+  n <- sum(!is.na(x))
+  
+  dplyr::if_else(
+    2L <= n,
+    s / n,
+    NA_real_
+  )
+}
+
+mtcars |>
+  dplyr::mutate(
+    m1 = (hp + drat + wt) / 3,
+    m2 =
+      rowMeans(
+        dplyr::across(hp:wt), # All columns between hp & wt.
+        na.rm = TRUE
+      ),
+    m3 =
+      rowMeans(
+        dplyr::across(!!columns_to_average),
+        na.rm = TRUE
+      ),
+    s4 = # Finding the sum (used by m4)
+      rowSums(
+        dplyr::across(!!columns_to_average),
+        na.rm = TRUE
+      ),
+    nonmissing =
+      rowSums(
+        dplyr::across(
+          !!columns_to_average,
+          .fns = \(x) { !is.na(x) }
+        )
+      ),
+    m4 = 
+      dplyr::if_else(
+        2 <= nonmissing,
+        s4 / nonmissing,
+        NA_real_
+      )
+  ) |>
+  dplyr::rowwise() |> # Required for `m5`
+  dplyr::mutate(
+    m5 = mean(dplyr::c_across(dplyr::all_of(columns_to_average))),
+  ) |>
+  dplyr::ungroup() |> # Clean up after rowwise()
+  dplyr::rowwise() |> # Required for `m6`
+  dplyr::mutate(
+    m6 = f6(dplyr::across(!!columns_to_average))
+  ) |>
+  dplyr::ungroup() |>   # Clean up after rowwise()
+  dplyr::select(
+    hp,
+    drat,
+    wt,
+    m1,
+    m2, 
+    m3, 
+    s4,
+    nonmissing,
+    m4,
+    m5, 
+    m6,
+  )
+
+
+

+C.3 Grooming +

+

-C.2.1 Correct for misinterpreted two-digit year +C.3.1 Correct for misinterpreted two-digit year

Background: Sometimes the Meditech dates are specified like 1/6/54 instead of 1/6/1954. readr::read_csv() has to choose if the year is supposed to be ‘1954’ or ‘2054’. A human can use context to guess a birth date is in the past (so it guesses 1954), but readr can’t (so it guesses 2054). For avoid this and other problems, request dates in an ISO-8601 format.

Explanation: Correct for this in a dplyr::mutate() clause; compare the date value against today. If the date is today or before, use it; if the day is in the future, subtract 100 years.

Instruction: For future dates such as loan payments, the direction will flip.

Last Modified: 2019-12-12 by Will

-
+
  ds |>
  dplyr::mutate(
     dob = dplyr::if_else(dob <= Sys.Date(), dob, dob - lubridate::years(100))
   )
-
+

-C.3 Identification +C.4 Identification

-
+

-C.3.1 Generating “tags” +C.4.1 Generating “tags”

Background: When you need to generate unique identification values for future people/clients/patients, as described in the style guide.

Explanation: This snippet will create a 5-row csv with random 7-character “tags” to send to the research team collecting patients. The

Instruction: Set pt_count, tag_length, path_out, and execute. Add and rename the columns to be more appropriate for your domain (e.g., change “patient tag” to “store tag”).

Last Modified: 2019-12-30 by Will

-
+
 pt_count    <- 5L   # The number of rows in the dataset.
 tag_length  <- 7L   # The number of characters in each tag.
 path_out    <- "data-private/derived/pt-pool.csv"
@@ -273,13 +371,13 @@ 

5 5 r5ei5ph FALSE -- --

-
+

-C.4 Correspondence with Collaborators +C.5 Correspondence with Collaborators

-
+

-C.4.1 Excel files +C.5.1 Excel files

Receiving and storing Excel files should almost always be avoided for the reasons explained in this letter.

We receive extracts as Excel files frequently, and have the following request ready to email the person sending us Excel files. Adapt the bold values like “109.19” to your situation. If you are familiar with their tools, suggest an alternative for saving the file as a csv. Once presented with these Excel gotchas, almost everyone has an ‘aha’ moment and recognizes the problem. Unfortunately, not everyone has flexible software and can adapt easily.

@@ -313,14 +411,15 @@

  • C.1.3 Removing Trailing Comma from Header
  • +
  • C.2 Row Operations
  • -C.2 Grooming +C.3 Grooming
  • -C.3 Identification +C.4 Identification
  • -C.4 Correspondence with Collaborators +C.5 Correspondence with Collaborators