SCUG, March 2018
-
Python & R tradeoffs on the follow dimensions
- production system vs research
- computer science background vs stats background
- data manipulation vs analysis
- propagation of ideas/manuscripts to external audiences
- development costs
-
knitr & automated reports
- some overlap with this 2013 presentation and this 2014 presentation.
-
GitHub
- some overlap with this 2014 presentation and this 2014 presentation.
-
benefits of promoting consistency of files/patterns across projects, and using skeletons (example).
-
REDCap & research
- creating REDCap projects
- token security
- REDCapR
- some overlap with this 2014 presentation.
-
yaml & csv
- flatten/denormalize list to data.frame example
-
controlling long pipelines with flow files, such as osdh-flow.R
-
config package
- centralize your project-wide settings so it's available & consistent across multiple files.
- similar to a project-wide 'declare-globals' chunk.
-
text editors
- my favorites: RStudio, Atom, and Notepad++.
- find & replace across files with regexes: Atom
- easily zoom in & out is especially nice when sharing screens: tie -- Atom & Notepad++
- multicolumn select: 1st place--RStudio and 2nd place--Atom (with the Sublime-Style-Column-Selection package)
-
tight text control
base::sprintf()
glue::glue()
& friends
-
Landing page for documentation across projects, such as BbmcResources
-
writing style guides with your team
- project-specific, such as the dashboard example.
- external consumption, such as the REDCap API Troubleshooting Guide.
- language-specific such as the
- tidyverse style guide for R, which derived from the
- Google's Style Guide for R and
- Hadley's Style Guide for R (this one is probably more representative what your team might produce to unify your projects)
-
Use skeleton repos to jumpstart your projects, such as RAnalysisSkeleton
-
verify-values
# ---- verify-values -----------------------------------------------------------
# Sniff out problems
# OuhscMunge::verify_value_headstart(ds)
checkmate::assert_integer(ds$county_month_id , lower= 1L , any.missing=F, unique=T)
checkmate::assert_integer(ds$county_id , lower= 1L , upper=77L, any.missing=F, unique=F)
checkmate::assert_date( ds$month , lower="2012-01-01" , any.missing=F)
checkmate::assert_integer(ds$region_id , lower= 1L , upper=20L, any.missing=F)
checkmate::assert_numeric(ds$fte , lower= 0 , upper=40L, any.missing=F)
checkmate::assert_logical(ds$fte_approximated , any.missing=F)
-
inequality joins with sqldf
Bounded by another table, using a join
d2 <- " SELECT o.[.record_matching_id], o.gender, o.age_months, o.bmi, p.percentile AS percentile_lower, p.value FROM d_observed AS o LEFT OUTER JOIN d_pop_long AS p ON o.age_months = p.age_months AND o.gender = p.gender AND p.value < o.bmi " %>% sqldf::sqldf( stringsAsFactors = FALSE )
Cumulation, by restricting on itself
ds_visit_cumulative_count <- " SELECT b.week, b.program_code, b.worker_name, count(distinct a.case_number) as client_distinct_cumulative_by_worker FROM ds_visit_3 a JOIN ds_visit_3 b ON (a.week <= b.week) AND (a.program_code=b.program_code AND a.worker_name=b.worker_name) GROUP BY b.program_code, b.worker_name, b.week ORDER BY b.program_code, b.worker_name, b.week " %>% sqldf::sqldf()
Windows of time, using a join
ds_client_week_visit_goal <- " SELECT p.case_number, p.program_code, p.worker_name_last AS worker_name, p.week_start_inclusive, --COUNT(v.visit_date) AS visit_week_scheduled_count, SUM(v.visit_completed) AS visit_week_completed_count FROM ds_possible_client_week p LEFT JOIN ds_visit v ON ( p.case_number=v.case_number AND (p.week_start_inclusive <= v.visit_date AND v.visit_date<p.week_stop_exclusive) ) GROUP BY p.case_number, p.week_start_inclusive ORDER BY p.case_number, p.week_start_inclusive " %>% sqldf::sqldf()