Skip to content

Latest commit

 

History

History
136 lines (113 loc) · 6.59 KB

agenda-possible.md

File metadata and controls

136 lines (113 loc) · 6.59 KB

Open Agenda

SCUG, March 2018

Actual Topics

  1. Python & R tradeoffs on the follow dimensions

    • production system vs research
    • computer science background vs stats background
    • data manipulation vs analysis
    • propagation of ideas/manuscripts to external audiences
    • development costs
  2. knitr & automated reports

  3. GitHub

  4. benefits of promoting consistency of files/patterns across projects, and using skeletons (example).

  5. REDCap & research

Possible Topics (that weren't covered today)

  1. yaml & csv

    • flatten/denormalize list to data.frame example
  2. controlling long pipelines with flow files, such as osdh-flow.R

  3. config package

    • centralize your project-wide settings so it's available & consistent across multiple files.
    • similar to a project-wide 'declare-globals' chunk.
  4. text editors

    • my favorites: RStudio, Atom, and Notepad++.
    • find & replace across files with regexes: Atom
    • easily zoom in & out is especially nice when sharing screens: tie -- Atom & Notepad++
    • multicolumn select: 1st place--RStudio and 2nd place--Atom (with the Sublime-Style-Column-Selection package)
  5. tight text control

  6. Landing page for documentation across projects, such as BbmcResources

  7. writing style guides with your team

  8. Use skeleton repos to jumpstart your projects, such as RAnalysisSkeleton

  9. verify-values

# ---- verify-values -----------------------------------------------------------
# Sniff out problems
# OuhscMunge::verify_value_headstart(ds)
checkmate::assert_integer(ds$county_month_id    , lower=          1L              , any.missing=F, unique=T)
checkmate::assert_integer(ds$county_id          , lower=          1L   , upper=77L, any.missing=F, unique=F)
checkmate::assert_date(   ds$month              , lower="2012-01-01"              , any.missing=F)
checkmate::assert_integer(ds$region_id          , lower=          1L   , upper=20L, any.missing=F)
checkmate::assert_numeric(ds$fte                , lower=          0    , upper=40L, any.missing=F)
checkmate::assert_logical(ds$fte_approximated                                     , any.missing=F)
  1. inequality joins with sqldf

    Bounded by another table, using a join

    d2 <- "
      SELECT
        o.[.record_matching_id],
        o.gender,
        o.age_months,
        o.bmi,
        p.percentile     AS percentile_lower,
        p.value
      FROM d_observed AS o
        LEFT OUTER JOIN d_pop_long AS p ON
          o.age_months = p.age_months AND
          o.gender     = p.gender     AND
          p.value      < o.bmi
      " %>%
      sqldf::sqldf(
        stringsAsFactors = FALSE
      )   

    Cumulation, by restricting on itself

    ds_visit_cumulative_count <- "
      SELECT
        b.week, b.program_code, b.worker_name,
        count(distinct a.case_number) as     client_distinct_cumulative_by_worker
      FROM ds_visit_3 a
      JOIN ds_visit_3 b ON
        (a.week <= b.week)
        AND (a.program_code=b.program_code AND     a.worker_name=b.worker_name)
      GROUP BY b.program_code, b.worker_name, b.week
      ORDER BY b.program_code, b.worker_name, b.week
    " %>%
    sqldf::sqldf()

    Windows of time, using a join

    ds_client_week_visit_goal <- "
      SELECT
        p.case_number,
        p.program_code,
        p.worker_name_last                AS worker_name,
        p.week_start_inclusive,
        --COUNT(v.visit_date)              AS visit_week_scheduled_count,
        SUM(v.visit_completed)           AS visit_week_completed_count
      FROM ds_possible_client_week p
        LEFT JOIN ds_visit v ON (
          p.case_number=v.case_number
          AND
          (p.week_start_inclusive <= v.visit_date AND v.visit_date<p.week_stop_exclusive)
        )
      GROUP BY p.case_number, p.week_start_inclusive
      ORDER BY p.case_number, p.week_start_inclusive
    " %>%
      sqldf::sqldf()