From a3951ce6844aa8752abfcb65859a2f3e5d7ca1af Mon Sep 17 00:00:00 2001 From: Zihao Li Date: Mon, 19 Aug 2024 11:59:56 +0100 Subject: [PATCH 1/5] Development (#89) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Update to dev version * Make testthat run in parallel * Fix build tests (#63) * Update variables to pass tests * Update indiv number of variables * Change exists tests to read * Set an environment var to make testthat use multiple CPUs --------- Co-authored-by: James McMahon * Bug/tidyselect not working (#85) * Allow using tidyselect helpers with `col_select` * Add some tests for tidyselect helpers * Update documentation * recid and partnership filter allow for recid and partnership filter when they are not specified to select in columns * Style package * update tests --------- Co-authored-by: Moohan Co-authored-by: Zihao Li Co-authored-by: lizihao-anu * Documentation (#88) * Update maintainer to Megan (#69) * Update README.Rmd (#64) Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> * Bug - speed up `get_chi()` (#68) * Update to dev version * Make testthat run in parallel * Update variables to pass tests * Update indiv number of variables * Change exists tests to read * Set an environment var to make testthat use multiple CPUs * Revert changes and deal with NA chi/anon_chi * Update documentation * Style package * Update tests so that they pass * Style package * fix tests * Render `README.md` after changes to the `.Rmd` version * exclude from tests for now --------- Co-authored-by: James McMahon Co-authored-by: Jennit07 * Render `README.md` after changes to the `.Rmd` version (#70) Co-authored-by: github-merge-queue[bot] Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> * Bump actions/checkout from 3 to 4 (#66) Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v3...v4) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> * Bump peter-evans/create-pull-request from 4 to 5 (#65) Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 4 to 5. - [Release notes](https://github.com/peter-evans/create-pull-request/releases) - [Commits](https://github.com/peter-evans/create-pull-request/compare/v4...v5) --- updated-dependencies: - dependency-name: peter-evans/create-pull-request dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> * Bump stefanzweifel/git-auto-commit-action from 4 to 5 (#67) Bumps [stefanzweifel/git-auto-commit-action](https://github.com/stefanzweifel/git-auto-commit-action) from 4 to 5. - [Release notes](https://github.com/stefanzweifel/git-auto-commit-action/releases) - [Changelog](https://github.com/stefanzweifel/git-auto-commit-action/blob/master/CHANGELOG.md) - [Commits](https://github.com/stefanzweifel/git-auto-commit-action/compare/v4...v5) --- updated-dependencies: - dependency-name: stefanzweifel/git-auto-commit-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> * Bump JamesIves/github-pages-deploy-action from 4.4.3 to 4.5.0 (#71) Bumps [JamesIves/github-pages-deploy-action](https://github.com/jamesives/github-pages-deploy-action) from 4.4.3 to 4.5.0. - [Release notes](https://github.com/jamesives/github-pages-deploy-action/releases) - [Commits](https://github.com/jamesives/github-pages-deploy-action/compare/v4.4.3...v4.5.0) --- updated-dependencies: - dependency-name: JamesIves/github-pages-deploy-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump actions/upload-artifact from 3 to 4 (#72) Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 3 to 4. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](https://github.com/actions/upload-artifact/compare/v3...v4) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> * Bump peter-evans/create-pull-request from 5 to 6 (#74) Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 5 to 6. - [Release notes](https://github.com/peter-evans/create-pull-request/releases) - [Commits](https://github.com/peter-evans/create-pull-request/compare/v5...v6) --- updated-dependencies: - dependency-name: peter-evans/create-pull-request dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump actions/cache from 3 to 4 (#73) Bumps [actions/cache](https://github.com/actions/cache) from 3 to 4. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](https://github.com/actions/cache/compare/v3...v4) --- updated-dependencies: - dependency-name: actions/cache dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> * Update README.md (#75) Updated to include reading in LTC 'catch all' variables * change in episode file cost variable vector (#76) Co-authored-by: marjom02 * force keytime format to hms (#77) * force keytime format to hms * Update documentation * visible binding for global variables like โ€˜keytime1โ€™ * minor changes * fix keytime in column names * import hms --------- Co-authored-by: lizihao-anu * Bump JamesIves/github-pages-deploy-action from 4.5.0 to 4.6.0 (#79) Bumps [JamesIves/github-pages-deploy-action](https://github.com/jamesives/github-pages-deploy-action) from 4.5.0 to 4.6.0. - [Release notes](https://github.com/jamesives/github-pages-deploy-action/releases) - [Commits](https://github.com/jamesives/github-pages-deploy-action/compare/v4.5.0...v4.6.0) --- updated-dependencies: - dependency-name: JamesIves/github-pages-deploy-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * add vignette for SLFhelper documentation * Style package * Hide messages * remove conflict * Style package * Split up documentation into 3 vignettes * add a comparison table to show the efficiency improvement * Update - round memory size * replace columns by col_select and add tidyselect * Style package * update ep_file_vars and indiv_file_vars * add session memory recommendation * Update R-CMD-check.yaml * fix cmd build error --------- Signed-off-by: dependabot[bot] Co-authored-by: James McMahon Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> Co-authored-by: Jennit07 Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-merge-queue[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Megan McNicol <43570769+SwiftySalmon@users.noreply.github.com> Co-authored-by: marjom02 Co-authored-by: lizihao-anu Co-authored-by: Jennifer Thom * Update documentation (#90) Co-authored-by: github-merge-queue[bot] <118344674+github-merge-queue[bot]@users.noreply.github.com> Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> * Bump JamesIves/github-pages-deploy-action from 4.6.0 to 4.6.3 (#82) Bumps [JamesIves/github-pages-deploy-action](https://github.com/jamesives/github-pages-deploy-action) from 4.6.0 to 4.6.3. - [Release notes](https://github.com/jamesives/github-pages-deploy-action/releases) - [Commits](https://github.com/jamesives/github-pages-deploy-action/compare/v4.6.0...v4.6.3) --- updated-dependencies: - dependency-name: JamesIves/github-pages-deploy-action dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> * Increment version number to 0.10.2 (#81) * Increment version number to 0.10.2 * update description * Update R-CMD-check.yaml * Update yaml --------- Co-authored-by: Zihao Li --------- Signed-off-by: dependabot[bot] Co-authored-by: James McMahon Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> Co-authored-by: Moohan Co-authored-by: lizihao-anu Co-authored-by: Jennit07 Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-merge-queue[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Megan McNicol <43570769+SwiftySalmon@users.noreply.github.com> Co-authored-by: marjom02 Co-authored-by: Jennifer Thom Co-authored-by: github-merge-queue[bot] <118344674+github-merge-queue[bot]@users.noreply.github.com> --- .github/workflows/R-CMD-check.yaml | 2 +- .github/workflows/pkgdown.yaml | 2 +- DESCRIPTION | 11 +- NEWS.md | 4 +- R/read_slf.R | 82 +++-- data/ep_file_vars.rda | Bin 1438 -> 4673 bytes data/indiv_file_vars.rda | Bin 1196 -> 4272 bytes man/ep_file_vars.Rd | 2 +- man/indiv_file_vars.Rd | 2 +- tests/testthat/test-multiple_years.R | 8 +- tests/testthat/test-read_slf_episode.R | 9 +- tests/testthat/test-read_slf_individual.R | 5 +- tests/testthat/test-tidyselect_columns.R | 55 +++ vignettes/slf-documentation.Rmd | 98 +++++ vignettes/slfhelper-applications.Rmd | 416 ++++++++++++++++++++++ vignettes/using-arrow-table.Rmd | 80 +++++ vignettes/variable-packs.Rmd | 29 +- 17 files changed, 753 insertions(+), 52 deletions(-) create mode 100644 tests/testthat/test-tidyselect_columns.R create mode 100644 vignettes/slf-documentation.Rmd create mode 100644 vignettes/slfhelper-applications.Rmd create mode 100644 vignettes/using-arrow-table.Rmd diff --git a/.github/workflows/R-CMD-check.yaml b/.github/workflows/R-CMD-check.yaml index 35aa114..ce2fba5 100644 --- a/.github/workflows/R-CMD-check.yaml +++ b/.github/workflows/R-CMD-check.yaml @@ -18,7 +18,7 @@ jobs: strategy: fail-fast: false matrix: - r_version: ['3.6.1', '4.0.2', '4.1.2', 'release', 'devel'] + r_version: ['4.0.2', '4.1.2', 'release'] env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} diff --git a/.github/workflows/pkgdown.yaml b/.github/workflows/pkgdown.yaml index 5bf3f5b..ef1fb74 100644 --- a/.github/workflows/pkgdown.yaml +++ b/.github/workflows/pkgdown.yaml @@ -41,7 +41,7 @@ jobs: - name: Deploy to GitHub pages ๐Ÿš€ if: github.event_name != 'pull_request' - uses: JamesIves/github-pages-deploy-action@v4.6.0 + uses: JamesIves/github-pages-deploy-action@v4.6.3 with: clean: false branch: gh-pages diff --git a/DESCRIPTION b/DESCRIPTION index 662ea5c..1123852 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,12 +1,15 @@ Type: Package Package: slfhelper Title: Useful functions for working with the Source Linkage Files -Version: 0.10.0.9000 +Version: 0.10.2 Authors@R: c( person("Public Health Scotland", , , "phs.source@phs.scot", role = "cph"), person("James", "McMahon", , "james.mcmahon@phs.scot", role = c("aut"), comment = c(ORCID = "0000-0002-5380-2029")), - person("Megan", "McNicol", , "megan.mcnicol2@phs.scot", role = c("cre", "aut")) + person("Megan", "McNicol", , "megan.mcnicol2@phs.scot", role = c("cre", "aut")), + person("Zihao", "Li", , "zihao.li@phs.scot", role = c("aut"), + comment = c(ORCID = "0000-0002-5178-2124")), + person("Jennifer", "Thom", , "jennifer.thom@phs.scot", role = c("aut")) ) Description: This package provides helper functions for working with the Source Linkage Files (SLFs). The functions are mainly focused on @@ -17,7 +20,7 @@ URL: https://public-health-scotland.github.io/slfhelper/, https://github.com/Public-Health-Scotland/slfhelper BugReports: https://github.com/Public-Health-Scotland/slfhelper/issues Depends: - R (>= 3.5.0) + R (>= 4.0.0) Imports: arrow (>= 12.0.1), cli (>= 3.6.1), @@ -53,4 +56,4 @@ Language: en-GB LazyData: true Roxygen: list(markdown = TRUE, roclets = c("collate","namespace", "rd", "vignette" )) -RoxygenNote: 7.3.1 +RoxygenNote: 7.3.2 diff --git a/NEWS.md b/NEWS.md index 40f48e1..5677e65 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,4 +1,6 @@ -# slfhelper (development version) +# slfhelper 0.10.2 + +# slfhelper 0.10.1 # slfhelper 0.10.0 diff --git a/R/read_slf.R b/R/read_slf.R index aa5021c..62b5174 100644 --- a/R/read_slf.R +++ b/R/read_slf.R @@ -53,15 +53,19 @@ read_slf <- function( # but the column wasn't selected we need to add it (and remove later) remove_partnership_var <- FALSE remove_recid_var <- FALSE - if (!is.null(col_select)) { - if (!is.null(partnerships) & - !("hscp2018" %in% col_select)) { - col_select <- c(col_select, "hscp2018") + if (!rlang::quo_is_null(rlang::enquo(col_select))) { + if (!is.null(partnerships) && + stringr::str_detect(rlang::quo_text(rlang::enquo(col_select)), + stringr::coll("hscp2018"), + negate = TRUE + )) { remove_partnership_var <- TRUE } - if (!is.null(recids) & file_version == "episode" & - !("recid" %in% col_select)) { - col_select <- c(col_select, "recid") + if (!is.null(recids) && file_version == "episode" && + stringr::str_detect(rlang::quo_text(rlang::enquo(col_select)), + stringr::coll("recid"), + negate = TRUE + )) { remove_recid_var <- TRUE } } @@ -71,27 +75,48 @@ read_slf <- function( function(file_path) { slf_table <- arrow::read_parquet( file = file_path, - col_select = !!col_select, + col_select = {{ col_select }}, as_data_frame = FALSE ) - if (!is.null(recids)) { + if (!is.null(partnerships)) { + if (remove_partnership_var) { + slf_table <- cbind( + slf_table, + arrow::read_parquet( + file = file_path, + col_select = "hscp2018", + as_data_frame = FALSE + ) + ) + } slf_table <- dplyr::filter( slf_table, - .data$recid %in% recids + .data$hscp2018 %in% partnerships ) + if (remove_partnership_var) { + slf_table <- dplyr::select(slf_table, -"hscp2018") + } } - if (!is.null(partnerships)) { + + if (!is.null(recids)) { + if (remove_recid_var) { + slf_table <- cbind( + slf_table, + arrow::read_parquet( + file = file_path, + col_select = "recid", + as_data_frame = FALSE + ) + ) + } slf_table <- dplyr::filter( slf_table, - .data$hscp2018 %in% partnerships + .data$recid %in% recids ) - } - if (remove_partnership_var) { - slf_table <- dplyr::select(slf_table, -"hscp2018") - } - if (remove_recid_var) { - slf_table <- dplyr::select(slf_table, -"recid") + if (remove_recid_var) { + slf_table <- dplyr::select(slf_table, -"recid") + } } return(slf_table) @@ -146,15 +171,16 @@ read_slf_episode <- function( } # TODO add option to drop blank CHIs? # TODO add a filter by recid option - - data <- read_slf( - year = year, - col_select = unique(col_select), - file_version = "episode", - partnerships = unique(partnerships), - recids = unique(recids), - as_data_frame = as_data_frame, - dev = dev + return( + read_slf( + year = year, + col_select = {{ col_select }}, + file_version = "episode", + partnerships = unique(partnerships), + recids = unique(recids), + as_data_frame = as_data_frame, + dev = dev + ) ) if ("keytime1" %in% colnames(data)) { @@ -203,7 +229,7 @@ read_slf_individual <- function( return( read_slf( year = year, - col_select = unique(col_select), + col_select = {{ col_select }}, file_version = "individual", partnerships = unique(partnerships), as_data_frame = as_data_frame, diff --git a/data/ep_file_vars.rda b/data/ep_file_vars.rda index e0279d813e53b81338c7feebc899a6c97dedcc25..d0e91505cb7ac45734c66c27da933d72bcbd5a4f 100644 GIT binary patch literal 4673 zcmZ`-+iu)85Zz?g-uv1~T(>Cl3kB?Tnx^kXKcHxVyb)ZAyJ9Sf3Mtvx{(6C;XC#h9 z$?1b(ox|af9L|mV$8Vn>R-ads$z(d2y?Q;FUf^?b@zHrtAu!T^m z3Hb7)i$wD})5IRaz*L@i&gH2rom8x$w7ya{;nBBABAoW6ExbA4D4CK$a-AM3r)4aB zo#;%S1=hqI>1`22Z%Su!>CU1ZAB^34oVQ5+XLO~^7lEmd-63|w_)5A;kbS}m8)9kQ zh`T1Ploi=WL`PZ>Z_0#^ye2G@JhEe>4wUevX0e7#jE}h`O>GXv=F6;-mFZ4CixjM2 z%CEJ|Om$`x%F>C0P7- z05eO~1HvSc;4qd{_^estbso02_ji3F9#*=Q|&mF6<0cCvD~j-xrm>{^inx|mw1 zr$J_0;kt$rOJ&P}$TrcrE-Euva0?ec=t}!o`Bs;xQKd(A?(^wFr2}=lR3D|uY8WJv zn&$&eS0xg)DUkOE^v->2?iyJi)hSlDJ!IA!(c?8?bd zWV_hf2AZmDM{d&eG!|SDH;|h4UBiM3WK;q-a3}dUvd;eQp3~H2dH~|own)7yzkxX! zUMd-lwur~ByvdXO-Tv-rFPtnM!`W{ln(`0YuZ+{)?QraCZ4gD>=nv6oyUXG1A774- zGz=ud$%Z=K$2wSIq-9aNPl65>P6YgMgdawDALF^oYL9#vrmxZi^?*@TZi`MFUWfpN zrcSe4-$;cBL^C7W%-U3-pl=oxd_+;fJ=82m0uM4%G)gR9q1WPXnN-K{eF+-=642ac zNhFA}?-j@UyaAOBqKgLhQ>pfI>pQ0e2F&Z(a?eQ{V*8O8mv;<3L}CvkF|PR#`w)qJ z7>RKsaE-S8?p`Q#lY{1VEi($mjM~w#zm=~KIplg~d6Zce6C1CyV=7%92`!Y(^VE7S zUJGh9sq0}2QJ)!OYBGlyeL^Hp?PZLqw~R40mNBNTGRD+W#+dra7*jJDW9lR$JVmAE zt>bA_nY?=%Edu7N+DTMx=ggDL^8O*SL}K-}c8UXiLt;(C*E`}099$pjM;<^77EGVJj(MsSJHt(ASy3U$ zdPRj8OUU(%Dku4cYW-VOvcb)gDM3D_;DX z#Nk$Dh8CjGo)_Lwu;B@3qIb`{NPXme7sv;lM=3SFv1Qifk z30vu2scU+*qHA-N=H1>jcZAFeTMDaG?Od0-Zd>+XB_7&g3_Vvk{pOLvO(~qPo3XJL zT53FYd%uf0-ha&TDPoS#vL<~B*|cxN=eV!v3A$F9DNwW{YFk*i23l}~W3u&^xIJ`S zy&meQ>8_)Tz5$gX>%m5CHu3!nb*@rT$ea(q1|0naumXjj6ewG3?saiAC+A?tcpgjxhXnV4W&@Bsed=vQtD9TQ;KDuo1;}z3OzyqQG05;|w$pmkH3)7!D&)FyU(-BwEK^boB6rUcJB#M9!YbZ zzlV+&+)kB=`z!}?QrS3ao$P3=(rwL4moR0fY{fuFiYzD6(8R6|gF@KCMrm?dnF}z- z9X@&>`Y57reSWkpYcO)5A$VTFdf4~bHf1N?ohz(%KYSi`wW(R~O`5`w44g9Ud~xC9 z>5+xWwvw5q_6wF|y*Av0-^MDO01ChY5MiKH0$@-R;}rq`df^QC6FLMj=wPb+{Y*(Z zkqB$VHZx^ZR!~+ERuVTzBOqd0*8k$yBoLrVu4rIm46PwxF#XJH#`5Cg)&5I(%( z843^;Yj!f+wG5gwGxCHzF3vVIAgdil9R}7}_r{6T(k&8tgH~V>X;Mgt)KqEXC5Y2v zDLnFlsE`3F-1@T1Eah=M0lAkT`RmA;R_fVI9qj4~D#~HrMc8mj87S)mKylwuWm3bQ z5ISCyR1%TV3H!JlM#$YT;V3eixlAMUzmlzXly43>yPX4$7v3G#uM5XX7yxDdP#2qN z<}kLduJ=y|id>nTl%=Cbi0dav(*c<5GzXA1$`9*H2i)W9z>(&`R*wPl^R3EkNV!P~MnIqllv z{kt8NEX+^4GQG1n_x_Z|Rx}P=bQ`Qj`cpnh&P>4=odu^(I@qsUs1_?GV(BEUxS(S; zX`F2W*1o4}R%yKT*iC!VGubFqSxD{9%xj8CVK7iYb4>PT4|xX2qaf;bi8X_F(nQfG z!Op&VU{{x3(o(G8`C1k38v~u_7D!6vZ>bEJQLe&H-4iYQe(T zKu~Lc&P3p%U8rNcNJvPWtsahIt*%5G2xMkg5wHu2OJraaf&#|eY@(h#sV#_7Vrm4N z!5%grx-6Y;hZQ~Du1&2oH9`hWBnEW|Q7D_oCxr4S;zU`vcWy9_-vz92(sr?w=Gdn4 zMVy#~Hg6#$BtD7CC*K~LJUAu}wCwe3ShWCnL}t4LtGnC1#PjI@=U&s*)bRLKO`Cc9 z3E6C?b=d2-1|?|xG&^jcv2Qob4C#c9n1r@O=Gabk&U6vllQAp87)58`JQFB^kDc-m z)k$Oy15(_`1h5{eXi0{MuE#Ozh?}_Rd1TEq0~&Lo1%Kk^1Gl^-xR)^9;2wz(V%MxD sgZgGhk)YwEM^s-mdv0Y|hX_(TMvS*RXXoP2KYsh^rPDb-1LITYyfc9R zXR_pSM>aASWJk3D=nDS*>*+o%RY_PTy?wbAB^Nrdl`@toF1+34YTY_km8Q&FbHhkZ ztuU6DdR<6kq$)i(V;N!65K&13QpUprCjTNUlIeia+ zQZNuT?FfrXi=8MfUBlcd_gRJUr*P&r~F!Q zPTxZcUm9BENK zEQ%Y$%BJ3ZhasOEXCF$1m_^JaMZAmIcusRievs}J=Fo7XB1?1a0faoR!91DiT|^{_ z&Lc9;a|v9u$Vu$yl#j!mbQF$~eE{xgO%wSqr+nfeT!u$7XqfRfKk5Q~hasOEXP+$G zB4(2G?qW7h>^nzBlTV_TM-H*TUOf^(DSz*RkLI0>@Nm*89%Y~Akq{bbp(;{}H>OHM zsI}0kY027lHwD%?d&;-38H(X04XtM!oDtTe0x!iB%;Qn|(8VJ!XIE5mBCRNbe1G4I z4^4F7>`WD6=hGTGfbyMW;cmHf2oIOWKA_CZ>Sx^DmCf3_sptGF7H*&`XEL>eL?YQa z!VTOFrhp#a)~1cY%6CmjY3kfU&xbq6$|9sW;&3I!H(3uLv`WqcmdH|OLpjC_jz)sb zId{9p-L79##aqn9o8%g(lQCB<5eJMimdZ~7+TXmeBsDw6u76b@Br=!np53w z@(}=5CeG`G+OMRaj36Bs%igT3CK#vweB;#KRpN7Lc_j14zLRr}5ORS372#&;2Dd>CUD$yQJHrP_E{FmO*65fGOxA3UcDkxnpIT4$`TG!1W)y; z*b(##c#LDY$vRhxuQv*VW4H#OvAQ=*TgxezTX>5@5iol0O75$D*;LzGu(f@m1Krfv zTU{ux|9P>mtT)bGQT^|JsAp^YQU*Iz(c4`pp#M2&Yv+Np_J|D+6${R7u{?0u)X7`@ zP$F;le04%}TNn=vxuEergo!W#PuGy$%s>vB5@Vr!hy6S8Fz6q^ah7dU$DzX_uWrt0 zNH;_!P1es0tB^ymUwptNqlj&+z|D*- z`08=>xLXlT%U2&VM=*TYK|$aEcy1AMf7cQYQOhn7+oT;w-yiddLge9R7Hd3Ujo^`} z^lSY>+{Ap=VE>ZX&xyUlc8}$?vF%%9n62H%Zd7$G2`oeCe1pu}RVcdyvhcOsYpg)- zO|%Of3ud5auR>I$QdJG#)Q1o0sZk8tA$;RPmyw7437_6_C(A9|POx+Q_z$pfrmiYz z&SaxiL2CGbg~yQ#4n6(K`81kuO24-F4dC{@5=3p>3&DVPs3R~1Tyf#eFajICo%X8U VWyVwu_im{yc`emj34d((?|*NZQh5LX literal 1196 zcmV;d1XKG$T4*^jL0KkKS@fq!=Kup8|A_zp|NnpjeP9Iu7C^uM|F8rA00B@0-rUZ@ zXH0Kn$Bvrhpm%Xfy$!2AGh9 z6V#d;QwMr?x;k|TyR zcKCUC{`;H9%hDdZlLvQa8KgK2;;3YmOB*UNq_e$V^u$`HCNt^XHUPyLK3_iN(6Zn} zyCAMRao*;+LcBY<*;rkrh)kX`4yOk}To_RmQAHLtky@sydO>TNDF}u8?z+5F=Aubm zi6nmf)#Zfhs$#GD02^3nkIUt)2y;Ab=cTQy;p-1fvyaI6Zb!!juPi|^*(qvP(^ zNv=!v(r$cTQE^6jZ-G`pK)Zxu`knI@)5KSDY$D-Xi;__h5p0Ena@P(i-!Wc;^w}$m z2tn6&?6(qm(?<)z&_EC&1a(wRV!4rrr){y!`32aBR>YxIwwQ4Bb6wB2{9sGNS`kUN ze4~F$YoD?n{ZT!*;^)qSjKU0}3r=uw@f@U$49OCq3ihZgK!dp}0uVtrpvALIjVmA= zJJbe-YmRIKXt8+6!~KY*WhiKE->YWMUY{MF*H+wyko0jI=nEU#lmd&W(@hz;$_mn_ zo}QY;Y3u2p{x2oQK*r;k7lazi%Hdl?K{r=1?1*>RDGz=J^) zHM5K$1xN&-iHRt}KqP{V6rJlKNxDFx5$6GWHayxEu6>c5Hik;jg7ItXFwM`a4gz{W zO9Zg%!!#+-2$GS7N*Idrm}9l1g`uP)NSKs~!V)CON-$DkAjHhoNkAp&--jby3@(gC z% - dplyr::slice_sample(n = 100) + dplyr::group_by(year) %>% + dplyr::slice_sample(n = 50) %>% + dplyr::ungroup() # Test for anything odd expect_s3_class(indiv, "tbl_df") @@ -35,7 +37,9 @@ test_that("read multiple years works for episode file", { ep <- read_slf_episode(c("1718", "1819"), col_select = c("year", "anon_chi") ) %>% - dplyr::slice_sample(n = 100) + dplyr::group_by(year) %>% + dplyr::slice_sample(n = 50) %>% + dplyr::ungroup() # Test for anything odd expect_s3_class(ep, "tbl_df") diff --git a/tests/testthat/test-read_slf_episode.R b/tests/testthat/test-read_slf_episode.R index da40af1..82f27bd 100644 --- a/tests/testthat/test-read_slf_episode.R +++ b/tests/testthat/test-read_slf_episode.R @@ -28,9 +28,8 @@ for (year in years) { expect_equal(nrow(ep_file), 110) }) - # Need to come back to this test - some files have different lengths - # test_that("Episode file has the expected number of variables", { - # # Test for correct number of variables (will need updating) - # expect_length(ep_file, 241) - # }) + test_that("Episode file has the expected number of variables", { + # Test for correct number of variables (will need updating) + expect_length(ep_file, 251) + }) } diff --git a/tests/testthat/test-read_slf_individual.R b/tests/testthat/test-read_slf_individual.R index 2fe1f9a..be9341e 100644 --- a/tests/testthat/test-read_slf_individual.R +++ b/tests/testthat/test-read_slf_individual.R @@ -15,9 +15,8 @@ test_that("Reads individual file correctly", { # Test for the correct number of rows expect_equal(nrow(indiv_file), 100) - # Need to come back to this test - some files have different lengths - # # Test for correct number of variables (will need updating) - # expect_length(indiv_file, 184) + # Test for correct number of variables (will need updating) + expect_length(indiv_file, 193) } }) diff --git a/tests/testthat/test-tidyselect_columns.R b/tests/testthat/test-tidyselect_columns.R new file mode 100644 index 0000000..f85fdc7 --- /dev/null +++ b/tests/testthat/test-tidyselect_columns.R @@ -0,0 +1,55 @@ +skip_on_ci() + + +test_that("tidyselect helpers work for column selection in the episode file", { + expect_named( + read_slf_episode("1920", col_select = dplyr::starts_with("dd")), + c("dd_responsible_lca", "dd_quality") + ) + expect_named( + read_slf_episode("1920", col_select = c("year", dplyr::starts_with("dd"))), + c("year", "dd_responsible_lca", "dd_quality") + ) + expect_named( + read_slf_episode("1920", col_select = !dplyr::matches("[aeiou]")) + ) +}) + +test_that("col_select works when columns are added", { + expect_named( + read_slf_episode("1920", col_select = "year", recids = "DD"), + "year" + ) + expect_named( + read_slf_episode("1920", col_select = "year", partnerships = "S37000001"), + "year" + ) + expect_named( + read_slf_episode( + "1920", + col_select = c("year", dplyr::contains("dd")), + recids = "DD" + ) + ) + expect_named( + read_slf_episode( + "1920", + col_select = c("year", dplyr::contains("cij")), + partnerships = "S37000001" + ) + ) +}) + +test_that("tidyselect helpers work for column selection in the individual file", { + expect_named( + read_slf_individual("1920", col_select = dplyr::starts_with("dd")), + c("dd_noncode9_episodes", "dd_noncode9_beddays", "dd_code9_episodes", "dd_code9_beddays") + ) + expect_named( + read_slf_individual("1920", col_select = c("year", dplyr::starts_with("dd"))), + c("year", "dd_noncode9_episodes", "dd_noncode9_beddays", "dd_code9_episodes", "dd_code9_beddays") + ) + expect_named( + read_slf_individual("1920", col_select = !dplyr::matches("[aeiou]")) + ) +}) diff --git a/vignettes/slf-documentation.Rmd b/vignettes/slf-documentation.Rmd new file mode 100644 index 0000000..fcebbbc --- /dev/null +++ b/vignettes/slf-documentation.Rmd @@ -0,0 +1,98 @@ +--- +title: "slf-documentation" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{slf-documentation} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +```{r setup, include = FALSE} +library(slfhelper) +``` + +## SLFhelper + +`SLFhelper` contains some easy to use functions designed to make working with the Source Linkage Files (SLFs) as efficient as possible. + +### Filter functions: + +- `year` returns financial year of interest. You can also select multiple years using `c("1718", "1819", "1920")` +- `recid` returns recids of interest. Selecting this is beneficial for specific analysis. +- `partnerships` returns partnerships of interest. Selecting certain partnerships will reduce the SLFs size. +- `col_select` returns columns of interest. This is the best way to reduce the SLFs size. + +### Data snippets: + +- `ep_file_vars` returns a list of all variables in the episode files. +- `indiv_file_vars` returns a list of all variables in the individual files. +- `partnerships` returns a list of partnership names (HSCP_2018 codes) +- `recid` returns a list of all recids available in the SLFs. +- `ep_file_bedday_vars` returns a list of all bedday related variables in the SLFs. +- `ep_file_cost_vars` returns a list of all cost related variables in the SLFs. + +### Anon CHI + +- Use the function `get_chi()` to easily switch `anon_chi` to `chi`. +- Use the function `get_anon_chi()` to easily switch `chi` to `anon_chi`. + +### Memory usage in SLFS + +While working with the Source Linkage Files (SLFs), it is recommended to use the features of the SLFhelper package to maximase the memory usage in posit, see [PHS Data Science Knowledge Base](https://public-health-scotland.github.io/knowledge-base/docs/Posit%20Infrastructure?doc=Memory%20Usage%20in%20SMR01.md) for further guidance on memory usage in posit workbench. + +Reading a full SLF file can be time consuming and take up resources on posit workbench. In the episode file there are `r length(slfhelper::ep_file_vars)` variables and around 12 million rows compared to the individual file where there are `r length(slfhelper::indiv_file_vars)` variables and around 6 million rows in each file. This can be reduced by using available selections in SLFhelper to help reduce the size of the SLFs for analysis and to free up resources in posit workbench. + +The tables below show the memory usage of each full size SLF. + +#### Episode File + +| Year | Memory Usage (GiB) | +|------|:------------------:| +| 1718 | 22 | +| 1819 | 22 | +| 1920 | 22 | +| 2021 | 19 | +| 2122 | 21 | +| 2223 | 21 | +| 2324 | 18 | + +#### Individual File + +| Year | Memory Usage (GiB) | +|------|:------------------:| +| 1718 | 6.8 | +| 1819 | 6.8 | +| 1920 | 7.0 | +| 2021 | 7.0 | +| 2122 | 7.0 | +| 2223 | 7.1 | +| 2324 | 5.1 | + +If one can use selection features in SLFhelper, the session memory requirement can be reduced. There are `r length(slfhelper::ep_file_vars)` columns for a year episode file of size around 20 GiB. Hence, on average, a column with all rows takes around 0.1 GiB, which can give a rough estimate on the session memory one needs. Taking Year 1920 as a demonstration, the following tables present various sizes of extracts from the SLF files, from 5 columns to all columns, along with the amount of memory required to work with the data one reads in. Keep in mind that tables below are just recommendations, and that memory usage depends on how one handles data and optimises data pipeline. + + +#### Episode File +| Column Number | Memory usage (GiB) | Session Memory Recommendation | +|---------------|:------------------:|---------------------------------------------------| +| 5 | 0.5 | 4 GiB (4096 MiB) | +| 10 | 1.4 | between 4 GiB (4096 MiB) and 8 GiB (8192 MiB) | +| 50 | 5.1 | between 8 GiB (8192 MiB) and 16 GiB (16384 MiB) | +| 150 | 13 | between 20 GiB (20480 MiB) and 38 GiB (38912 MiB) | +| 251 | 22 | between 32 GiB (32768 MiB) and 64 GiB (65536 MiB) | + +#### Individual File + +| Column Number | Memory usage (GiB) | Session Memory Recommendation | +|---------------|:------------------:|---------------------------------------------------| +| 5 | 0.7 | 4 GiB (4096 MiB) | +| 10 | 0.8 | 4 GiB (4096 MiB) | +| 50 | 2.2 | between 4 GiB (4096 MiB) and 8 GiB (8192 MiB) | +| 150 | 5.5 | between 8 GiB (8192 MiB) and 16 GiB (16384 MiB) | +| 193 | 7.0 | between 11 GiB (11264 MiB) and 21 GiB (21504 MiB) | diff --git a/vignettes/slfhelper-applications.Rmd b/vignettes/slfhelper-applications.Rmd new file mode 100644 index 0000000..4b4c3c4 --- /dev/null +++ b/vignettes/slfhelper-applications.Rmd @@ -0,0 +1,416 @@ +--- +title: "slfhelper-applications" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{slfhelper-applications} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +```{r setup} +library(slfhelper) +``` + +## Examples using SLFhelper + +1. A&E attendances in East Lothian by age group. + +Produce a table to compare A&E Attendances for the following age groups (0-17, 18-64, 65-74, 75-84, 85+) for 2018/19 in East Lothian HSCP. +```{r chunk2, eval=FALSE, message=FALSE} +# read in data required from slf individual file - filter for year 2018/19 +el_1819 <- read_slf_individual( + year = "1819", + # select variables needed + col_select = c("age", "ae_attendances"), + # filter partnership for East Lothian + partnerships = "S37000010" +) + +# create age bands +age_labs <- c("0-17", "18-64", "65-74", "75-84", "85+") # create age labels + +# create age group variable +el_1819 <- el_1819 %>% + mutate(age_group = cut(age, + breaks = c(-1, 17, 64, 74, 84, 150), labels = age_labs + )) + +# produce summary table +output_table_1 <- el_1819 %>% + group_by(age_group) %>% + summarise(attendances = sum(ae_attendances)) %>% + ungroup() +``` + + +2. Outpatient attendances by specialty and gender. + +Create a table to compare the number of outpatient attendances (SMR00) broken down by specialty and gender in 2017/18 in Scotland. + +```{r chunk3, eval=FALSE, message=FALSE} +# read in specialty lookup with names +spec_lookup <- + read_csv("/conf/linkage/output/lookups/Unicode/National Reference Files/Specialty.csv") %>% + select( + spec = Speccode, + spec_name = Description + ) + +# read in data required from slf episode file - filter year = 2017/18 +op_1718 <- read_slf_episode( + year = "1718", + # select columns + col_select = c("recid", "gender", "spec"), + # filter on recid for outpatients + recids = "00B" +) + +# produce output +output_table_2 <- op_1718 %>% + # get counts by specialty and gender + count(spec, gender) %>% + # exclude those with no gender recorded + filter(gender == 1 | gender == 2) %>% + # recode gender into M/F + mutate(gender = recode(as.character(gender), "1" = "Male", "2" = "Female")) %>% + # move gender to separate columns + pivot_wider(names_from = gender, values_from = n) %>% + # match on specialty names + left_join(spec_lookup) %>% + # reorder variables + select(spec, spec_name, Male, Female) +``` + +3. Hospital admissions & beddays by HB of residence. + +Produce a table to compare the number of admissions, bed days and average length of stay (split into elective and non-elective) by Health Board of Residence in 2018/19. +```{r chunk4, eval=FALSE, message=FALSE} +# Read in names for Health Boards +hb_lookup <- + read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Health Board Area 2019 Lookup.csv") %>% + select( + hb2019 = HealthBoardArea2019Code, + hb_desc = HealthBoardArea2019Name + ) + +# read in data required from slf individual file - filter for 2018/19 +indiv_1819 <- read_slf_individual( + year = "1819", + # Select columns of interest + col_select = c( + "hb2019", "cij_el", "cij_non_el", + "acute_el_inpatient_beddays", + "mh_el_inpatient_beddays", + "gls_el_inpatient_beddays", + "acute_non_el_inpatient_beddays", + "mh_non_el_inpatient_beddays", + "gls_non_el_inpatient_beddays" + ) +) + + +# calculate total bed days and add on HB names +indiv_1819_inc_totals <- indiv_1819 %>% + # calculate overall bed days + mutate( + elective_beddays = acute_el_inpatient_beddays + mh_el_inpatient_beddays + + gls_el_inpatient_beddays, + non_elective_beddays = acute_non_el_inpatient_beddays + mh_non_el_inpatient_beddays + + gls_non_el_inpatient_beddays + ) %>% + # match on HB name + left_join(hb_lookup) + +# produce summary table +output_table_3 <- indiv_1819_inc_totals %>% + # group by HB of residence + group_by(hb2019, hb_desc) %>% + # produce summary table + summarise( + elective_adm = sum(cij_el), + non_elective_adm = sum(cij_non_el), + elective_beddays = sum(elective_beddays), + non_elective_beddays = sum(non_elective_beddays) + ) %>% + # calculate average length of stay + mutate( + elective_alos = elective_beddays / elective_adm, + non_elective_alos = non_elective_beddays / non_elective_adm + ) +``` + +4. GP Out of Hours Consulations in South Ayrshire. + +Create a table showing the number of GP Out of Hours consultations for patients with dementia in South Ayrshire HSCP in 2019/20 broken down by type of consultation. +```{r chunk5, eval=FALSE, message=FALSE} +# read in data required from slf episode file - filter for year = 2019/20 +sa_1920 <- read_slf_episode( + year = "1920", + # select columns + col_select = c("dementia", "smrtype"), + # filter for South Ayrshire HSCP + partnerships = "S37000027", + # Filter for GP OOH data + recids = "OoH" +) + +# select dementia patients +sa_dementia_1920 <- sa_1920 %>% + filter(dementia == 1) + +# produce summary table +output_table_4 <- sa_dementia_1920 %>% + count(smrtype) +``` + +5. Costs in Aberdeen City. + +Produce a table to show the number of patients and the total costs for Aberdeen City HSCP in 2018/19. Include a breakdown of costs for the following services: Acute (inpatients & daycases), GLS, Mental Health and Maternity, Outpatients, A&E, GP Out of Hours, Community Prescribing. +```{r chunk6, eval=FALSE, message=FALSE} +# read in data required from slf individual file - filter year = 2018/19 +ab_1819 <- read_slf_individual( + year = "1819", + # select columns + col_select = c( + "acute_cost", "gls_cost", "mh_cost", "mat_cost", + "op_cost_attend", "ae_cost", "ooh_cost", "pis_cost", + "health_net_cost" + ), + # filter for Aberdeen City + partnerships = "S37000001" +) + +# Have used variables which exclude the cost of outpatient attendances which did +# not attend (DNA) but you could also include this if needed. + +# produce summary table +output_table_5 <- ab_1819 %>% + # rename outatients variable + rename(op_cost = op_cost_attend) %>% + # sum of all cost variables and number of patients + summarise(across(ends_with("_cost"), ~ sum(.x, na.rm = TRUE)), + patients = n() + ) %>% + # switch to rows + pivot_longer(everything()) +``` + +6. Deaths from Dementia / Alzheimers + +Produce a chart to show the number of deaths from 2015/16 to 2019/20 in Scotland where the main cause of death was recorded as Dementia/Alzheimers (ICD 10 codes: G30, F01-F03, F05.1). + +```{r chunk7, eval=FALSE, message=FALSE} +# read in data required from slf episode file - filter for years 2015/16 to 2019/20 +deaths <- read_slf_episode( + year = c("1516", "1617", "1718", "1819", "1920"), + # select columns + col_select = c("year", "deathdiag1"), + # Filter for death records + recids = "NRS" +) + +# extract 3 & 4 digit codes and select those with dementia +dementia_deaths <- deaths %>% + # extract 3 & 4 digit ICD 10 codes + mutate( + diag_3d = str_sub(deathdiag1, 1, 3), + diag_4d = str_sub(deathdiag1, 1, 4) + ) %>% + # select dementia codes + filter(diag_3d == "G30" | diag_3d == "F00" | diag_3d == "F01" | + diag_3d == "F02" | diag_3d == "F03" | diag_4d == "F051") + +# produce summary table +output_table_6 <- dementia_deaths %>% + count(year) %>% + rename(deaths = n) +``` + +7. Number and cost of prescriptions for MS + +Create a table to compare the number and cost of prescribed items for patients with Multiple Sclerosis (MS) by HSCP in 2018/19. Include the number of dispensed items and cost per patient. + +```{r chunk8, eval=FALSE, message=FALSE} +# read in HSCP names (used in exercises 7 & 9) +hscp_lookup <- read_csv("/conf/linkage/output/lookups/Unicode/Geography/Scottish Postcode Directory/Codes and Names/Integration Authority 2019 Lookup.csv") %>% + select( + hscp2019 = IntegrationAuthority2019Code, + hscp_desc = IntegrationAuthority2019Name + ) + +# read in data required from slf episode file - filter for year = 2018/19 +pis_1819 <- read_slf_individual("1819", + col_select = c("hscp2019", "ms", "pis_paid_items", "pis_cost") +) + + +# select all patients with MS & add on HSCP name +ms_1819 <- pis_1819 %>% + filter(ms == 1) %>% + left_join(hscp_lookup) + +# produce summary table +output_table_7 <- ms_1819 %>% + # group by hscp + group_by(hscp2019, hscp_desc) %>% + # sum up number of items, costs & patients with MS (not all will have had prescription) + summarise( + pis_paid_items = sum(pis_paid_items), + pis_cost = sum(pis_cost), + patients = sum(ms) + ) %>% + ungroup() %>% + # calculate number of items / cost per patient + mutate( + items_per_patient = pis_paid_items / patients, + cost_per_patient = pis_cost / patients + ) +``` + +8. A&E attendance in last 3 months of life. + +Produce a table to show the number of deaths in Glasgow City HSCP in 2019/20 and what proportion had an A&E attendance in the last 3 months of life. + +```{r chunk9, eval=FALSE, message=FALSE} +# extract all deaths in Glasgow City in 1920 - Filter year = 1920 +gc_deaths <- read_slf_episode( + year = "1920", + # select columns + col_select = c("anon_chi", "death_date"), + # filter for Glasgow City + partnerships = "S37000015", + # Filter for death records + recids = "NRS" +) %>% + # exclude those with missing chi + filter(anon_chi != "") %>% + # exclude duplicates + distinct(anon_chi, death_date) + +# extract all A&E attendances in 1819 & 1920 +ae <- read_slf_episode( + year = c("1819", "1920"), + # select columns + col_select = c("anon_chi", "recid", "record_keydate1"), + # filter for A&E data + recids = "AE2" +) %>% + # exclude those with missing chi + filter(anon_chi != "") %>% + # rename date of attendance + rename(attendance_date = record_keydate1) + +# select A&E attendances for those individuals who are in the GC deaths file +ae_gc <- ae %>% + # filter A&E attendances for those in deaths file + semi_join(gc_deaths) %>% + # match on date of death + left_join(gc_deaths) + +# select A&E attendances which are within 3 months of death (counted as 91 days) +ae_gc_3m <- ae_gc %>% + # create 3 month interval + mutate(int_3m = interval(death_date - days(91), death_date)) %>% + # flag if attendance is in 3 month interval + mutate(att_3m = if_else(attendance_date %within% int_3m, 1, 0)) %>% + # select only those attendances in 3 months before death + filter(att_3m == 1) + +# create list of patients with A&E attendance in 3m period +pats_ae_3m <- ae_gc_3m %>% + # select only chi and attendance flag + select(anon_chi, att_3m) %>% + # restrict to one row per person + distinct() + +# final output for total number of deaths and number with an A&E attendance in last 3 months +output_table_8 <- gc_deaths %>% + # match on attendance flag + left_join(pats_ae_3m) %>% + # summarise total deaths and deaths with A&E attendance in last 3 months + summarise( + deaths = n(), + deaths_with_ae_att = sum(att_3m, na.rm = TRUE) + ) %>% + # calculate % + mutate(prop_ae_3m = deaths_with_ae_att / deaths) +``` + +9. Non elective admissions in Geriatric Medicine. + +Create a table showing the number of non-elective admissions with any part of the stay (Continuous Inpatient Journey, CIJ) in the specialty Geriatric Medicine, by HSCP in 2019/20. Also include the associated bed days, cost and number of patients. + +```{r chunk10, eval=FALSE, message=FALSE} +# extract data required from episode file +smr_1920 <- read_slf_episode( + year = "1920", + col_select = c( + "anon_chi", "record_keydate1", "record_keydate2", + "spec", "hscp2019", "yearstay", "cost_total_net", + "cij_marker", "cij_pattype" + ), + recids = c("01B", "GLS", "04B") +) %>% + # exclude those with missing chi + filter(anon_chi != "") + +# flag episodes in Geriatric Medicine specialty AB +smr_1920 <- smr_1920 %>% + mutate(ger_med = if_else(spec == "AB", 1, 0)) + +# select only those from non-elective stays +smr_1920_ne <- smr_1920 %>% + filter(cij_pattype == "Non-Elective") + +# aggregate to cij level +# we want to keep eariest admission and latest discharge, keep flag if any episode was in spec AB +# take hscp from the last record and sum beddays & cost +cij_1920 <- smr_1920_ne %>% + arrange(anon_chi, cij_marker, record_keydate1, record_keydate2) %>% + group_by(anon_chi, cij_marker) %>% + summarise( + across(record_keydate1, min), + across(c(record_keydate2, ger_med), max), + across(c(cij_pattype, hscp2019), last), + across(c(yearstay, cost_total_net), sum) + ) %>% + ungroup() + +# select only admissions with part of their stay in Geriatric Medicine specialty +cij_ger_med <- cij_1920 %>% + filter(ger_med == 1) + +# aggregate up to patient level +# we want to keep eariest admission and latest discharge, keep flag if any episode was in spec AB +# take hscp from the last record and sum beddays & cost +pat_1920 <- cij_ger_med %>% + group_by(anon_chi, hscp2019) %>% + summarise( + across(c(ger_med, yearstay, cost_total_net), sum) + ) %>% + ungroup() + +# produce output +# note patients may be counted in more than one hscp +output_table_9 <- pat_1920 %>% + # match on hscp names + left_join(hscp_lookup) %>% + # group up to hscp level + group_by(hscp2019, hscp_desc) %>% + # sum up measures + summarise( + admissions = sum(ger_med), + beddays = sum(yearstay), + cost = sum(cost_total_net), + patients = n() + ) %>% + ungroup() +``` + diff --git a/vignettes/using-arrow-table.Rmd b/vignettes/using-arrow-table.Rmd new file mode 100644 index 0000000..6e9d073 --- /dev/null +++ b/vignettes/using-arrow-table.Rmd @@ -0,0 +1,80 @@ +--- +title: "Using Parquet files with the arrow package" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{using-arrow-table} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +## Using Parquet files with the arrow package + +The SLFs are available in parquet format. The {arrow} package gives some extra features which can speed up and reduce memory usage even further. You can read only specific columns `read_parquet(file, col_select = c(var1, var2))`. + +Using arrow's 'Arrow Table' feature, you can speed up analysis efficiently. To do this, specify `as_data_frame = FALSE` when using SLFhelper and `dplyr::collect()` to read the data. + +#### For example: + +Imagine a scenario of analysing planned and unplanned beddays in Scotland, there are two ways to read the episode files and do analysis by setting `as_data_frame` to be `TRUE` or `FALSE` as follows. + +```{r arrow, eval=FALSE, message=FALSE} +library(slfhelper) + +## FAST METHOD +# Filter for year of interest +slf_extract1 <- read_slf_episode(c("1819", "1920"), + # Select recids of interest + recids = c("01B", "GLS", "04B"), + # Select columns + col_select = c( + "year", "anon_chi", "recid", + "yearstay", "age", "cij_pattype" + ), + # return an arrow table + as_data_frame = FALSE +) %>% + # Filter for non-elective and elective episodes + dplyr::filter(cij_pattype == "Non-Elective" | cij_pattype == "Elective") %>% + # Group by year and cij_pattype for analysis + dplyr::group_by(year, cij_pattype) %>% + # summarise bedday totals + dplyr::summarise(beddays = sum(yearstay)) %>% + # collect the arrow table + dplyr::collect() + +## SLOW and DEFAULT Method +# Filter for year of interest +slf_extract2 <- read_slf_episode(c("1819", "1920"), + # Select recids of interest + recids = c("01B", "GLS", "04B"), + # Select columns + col_select = c( + "year", "anon_chi", "recid", + "yearstay", "age", "cij_pattype" + ), + # return an arrow table + as_data_frame = TRUE # which is default +) %>% + # Filter for non-elective and elective episodes + dplyr::filter(cij_pattype == "Non-Elective" | cij_pattype == "Elective") %>% + # Group by year and cij_pattype for analysis + dplyr::group_by(year, cij_pattype) %>% + # summarise bedday totals + dplyr::summarise(beddays = sum(yearstay)) +``` + +By specifying `as_data_frame = FALSE` when using reading SLF functions, one enjoys great advantages of `parquet` files. One of the advantages is fast query processing by reading only the necessary columns rather than entire rows. The table below demonstrates the huge impact of those advantages. + +| | Time consumption (seconds) | Memory usage (MiB) | +|-------------------------|:--------------------------:|:------------------:| +| `as_data_frame = TRUE` | 4.46 | 553 | +| `as_data_frame = FALSE` | 1.82 | 0.43 | + +: Comparison of different ways of reading SLF files diff --git a/vignettes/variable-packs.Rmd b/vignettes/variable-packs.Rmd index 61e0072..78b0033 100644 --- a/vignettes/variable-packs.Rmd +++ b/vignettes/variable-packs.Rmd @@ -16,10 +16,11 @@ knitr::opts_chunk$set( ## Selecting only specified variables -It is recommended to only choose the variables you need when reading in a Source Linkage File. This can be achieved by specifying a `column` argument to the relevant `read_slf_` function. +It is recommended to only choose the variables you need when reading in a Source Linkage File. This can be achieved by specifying a `col_select` argument to the relevant `read_slf_` function. This will result in the data being read in much faster as well as being easy to work with. The full episode and individual files have 200+ and 100+ variables respectively! + ```{r load-package, include=FALSE} library(slfhelper) ``` @@ -27,9 +28,27 @@ library(slfhelper) ```{r column-example, eval=FALSE} library(slfhelper) -ep_data <- read_slf_episode(year = 1920, columns = c("year", "anon_chi", "recid")) +ep_data <- read_slf_episode(year = 1920, col_select = c("year", "anon_chi", "recid")) + +indiv_data <- read_slf_individual(year = 1920, col_select = c("year", "anon_chi", "nsu")) +``` -indiv_data <- read_slf_individual(year = 1920, columns = c("year", "anon_chi", "nsu")) +## Selecting variables using `tidyselect` functions +It is now allowed to use `tidyselect` functions, such as `contains()` and `start_with()`, to select variables in relevant `read_slf_` function. One can also mix `tidyselect` functions with specified variables when selecting. + +```{r tidyselect, eval=FALSE} +library(slfhelper) +ep_data <- + read_slf_episode( + year = 1920, + col_select = !tidyselect::contains("keytime") + ) + +indiv_data <- + read_slf_individual( + year = 1920, + col_select = c("year", "anon_chi", "nsu", tidyselect::starts_with("sds")) + ) ``` ## Looking up variable names @@ -85,7 +104,7 @@ For example to take some demographic data and LTC flags from the individual file ```{r use-ltc-indiv, eval=FALSE} library(slfhelper) -indiv_ltc_data <- read_slf_individual(year = 1920, columns = c("year", demog_vars, ltc_vars)) +indiv_ltc_data <- read_slf_individual(year = 1920, col_select = c("year", demog_vars, ltc_vars)) ``` @@ -95,7 +114,7 @@ library(slfhelper) acute_beddays <- read_slf_episode( year = 1920, - columns = c("year", "anon_chi", "hbtreatcode", "recid", ep_file_bedday_vars, "cij_pattype"), + col_select = c("year", "anon_chi", "hbtreatcode", "recid", ep_file_bedday_vars, "cij_pattype"), recid = c("01B", "GLS") ) ``` From c40144c429bd467264497a8cb2ab9c6ed99d6afe Mon Sep 17 00:00:00 2001 From: Zihao Li Date: Mon, 19 Aug 2024 12:10:38 +0100 Subject: [PATCH 2/5] Increment version number to 0.10.3 (#91) --- DESCRIPTION | 2 +- NEWS.md | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/DESCRIPTION b/DESCRIPTION index 1123852..1ec9689 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,7 +1,7 @@ Type: Package Package: slfhelper Title: Useful functions for working with the Source Linkage Files -Version: 0.10.2 +Version: 0.10.3 Authors@R: c( person("Public Health Scotland", , , "phs.source@phs.scot", role = "cph"), person("James", "McMahon", , "james.mcmahon@phs.scot", role = c("aut"), diff --git a/NEWS.md b/NEWS.md index 5677e65..912e85f 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,5 @@ +# slfhelper 0.10.3 + # slfhelper 0.10.2 # slfhelper 0.10.1 From 722d604e09d08c4435851e54f7c8726559ede82f Mon Sep 17 00:00:00 2001 From: Jennit07 <67372904+Jennit07@users.noreply.github.com> Date: Mon, 19 Aug 2024 14:09:19 +0100 Subject: [PATCH 3/5] Update README.md (#92) --- README.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/README.md b/README.md index 2887f26..69622eb 100644 --- a/README.md +++ b/README.md @@ -41,6 +41,23 @@ install.packages("pak") pak::pak("Public-Health-Scotland/slfhelper") ``` +## Additional Documentation + +### SLFhelper Articles + +We have created some additional documentation to help with using +SLFhelper. The new documentation can be found here: https://public-health-scotland.github.io/slfhelper/ +by selecting the drop down for Articles. + +This includes: +* Memory usage when using SLFs in Posit +* Using the arrow package with SLFs +* Example questions which have been answered using the SLFs (full code available) + +If you find this useful and would like to see other topics included in +future documentation, please get in touch with the team. + + ## Usage ### Read a file From c25c6954b1ae0d4b93354782ecdfd887d621a7a1 Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 9 Sep 2024 10:19:41 +0100 Subject: [PATCH 4/5] Bump JamesIves/github-pages-deploy-action from 4.6.3 to 4.6.4 (#96) Bumps [JamesIves/github-pages-deploy-action](https://github.com/jamesives/github-pages-deploy-action) from 4.6.3 to 4.6.4. - [Release notes](https://github.com/jamesives/github-pages-deploy-action/releases) - [Commits](https://github.com/jamesives/github-pages-deploy-action/compare/v4.6.3...v4.6.4) --- updated-dependencies: - dependency-name: JamesIves/github-pages-deploy-action dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --- .github/workflows/pkgdown.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/pkgdown.yaml b/.github/workflows/pkgdown.yaml index ef1fb74..2aa7a41 100644 --- a/.github/workflows/pkgdown.yaml +++ b/.github/workflows/pkgdown.yaml @@ -41,7 +41,7 @@ jobs: - name: Deploy to GitHub pages ๐Ÿš€ if: github.event_name != 'pull_request' - uses: JamesIves/github-pages-deploy-action@v4.6.3 + uses: JamesIves/github-pages-deploy-action@v4.6.4 with: clean: false branch: gh-pages From 995634caba717819632c75014bf7716a13ef24fa Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon, 9 Sep 2024 09:39:15 +0000 Subject: [PATCH 5/5] Bump peter-evans/create-pull-request from 6 to 7 (#97) Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 6 to 7. - [Release notes](https://github.com/peter-evans/create-pull-request/releases) - [Commits](https://github.com/peter-evans/create-pull-request/compare/v6...v7) --- updated-dependencies: - dependency-name: peter-evans/create-pull-request dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jennit07 <67372904+Jennit07@users.noreply.github.com> --- .github/workflows/document.yaml | 2 +- .github/workflows/render-README.yaml | 2 +- .github/workflows/style.yaml | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/.github/workflows/document.yaml b/.github/workflows/document.yaml index 9280054..c7b3ef2 100644 --- a/.github/workflows/document.yaml +++ b/.github/workflows/document.yaml @@ -34,7 +34,7 @@ jobs: - name: Commit and create a Pull Request on development if: ${{ github.ref == 'refs/heads/development' }} - uses: peter-evans/create-pull-request@v6 + uses: peter-evans/create-pull-request@v7 with: commit-message: "Update documentation" branch: document_development diff --git a/.github/workflows/render-README.yaml b/.github/workflows/render-README.yaml index 3ed5814..21a6eca 100644 --- a/.github/workflows/render-README.yaml +++ b/.github/workflows/render-README.yaml @@ -35,7 +35,7 @@ jobs: - name: Commit and create a Pull Request on production if: ${{ github.ref == 'refs/heads/production' }} - uses: peter-evans/create-pull-request@v6 + uses: peter-evans/create-pull-request@v7 with: commit-message: "Render `README.md` after changes to the `.Rmd` version" branch: render_readme diff --git a/.github/workflows/style.yaml b/.github/workflows/style.yaml index 9a2f67b..d4ff9a4 100644 --- a/.github/workflows/style.yaml +++ b/.github/workflows/style.yaml @@ -60,7 +60,7 @@ jobs: - name: Commit and create a Pull Request on development if: ${{ github.ref == 'refs/heads/development' }} - uses: peter-evans/create-pull-request@v6 + uses: peter-evans/create-pull-request@v7 with: commit-message: "Style package" branch: document_development