The goal of DHSCdatatools is to provide a suite of tools for using data hosted on the DHSC analytical cloud (DAC) platform.
You can install DHSCdatatools from GitHub with:
if (!requireNamespace("librarian")) install.packages("librarian")
librarian::stock(DataS-DHSC/DHSCdatatools)
To reinstall DHSCdatatools use:
librarian::stock(DataS-DHSC/DHSCdatatools, update_all = TRUE)
Once installed you will also need to create/update the .Renviron file
in the root folder of your RStudio project to add the DAC_TENANT
and
KEY_VAULT_NAME
keys to your R environment. These environment variables
are automatically read by the code and tell it how to find the DAC Key
Vault which stores all the configuration values used to connect. To open
the .Renviron file for editing run then following command in the
console from within your project (you may need to install the usethis
package first):
usethis::edit_r_environ("project")
This will open your projects .Renviron file in your source pane. Next navigate to the DAC channel on the Data Science Teams space and copy the text from the DAC connect: .Renviron settings post into your .Renviron file. Save and close the .Renviron file then run the following command to reload it in the current session (.Renviron files are automatically reloaded when a project is opened):
readRenviron(".Renviron")
When the code is first run, a dialogue box may pop up asking to create a folder to save your authentication credentials - please click yes if this occurs.
If you are having issues with your authentication try passing disabling
the authentication cache with: dac_odbc_connect(use_cache = FALSE)
Never commit your .Renviron to git or upload to GitHub. To ensure git
ignores the file add it to the project’s .gitignore using the following
command from the console (you may need to install the usethis
package
first):
usethis::use_git_ignore(".Renviron")
If you do accidentally commit your .Renviron file (or any other sensitive data) please get in touch with the Data Science Hub to discuss how best to mitigate the breach.
To connect to the SQL endpoint you will also need to install the Simba Spark ODBC drivers (32-bit and 64-bit) via the IT service portal. These drivers allow the code to interact with the DAC as if it were a database.
To connect to the DAC SQL endpoint and access data from, for example, the samples.default.mtcars table use:
library(DHSCdatatools)
library(tidyverse)
library(dbplyr)
library(DBI)
# re-read in environment variables
readRenviron(".Renviron")
# Get connection to the DAC
con <- dac_odbc_connect(connections_pane = TRUE)
# Table to query - path normally given in the form "<catalog>.<schema>.<table>"
table_path <- in_catalog("samples", "default", "mtcars")
# Query table and collect results from server
# can use any dplyr verbs supported by odbc driver
df <- tbl(con, table_path) |>
group_by(cyl) |>
summarise(mpg = mean(mpg, na.rm = TRUE)) |>
arrange(desc(mpg))
# Queries are lazily evaluated (meaning nothing is run until needed and as much
# processing is done on the remote server as possible).
print(df)
df |> collect() |> view()
# it is also possible to run SQL directly
df <- dbGetQuery(con, "SELECT * FROM samples.default.mtcars")
# close the connection if no longer used
dbDisconnect(con)
Please note that the DHSCdatatools project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.
All other content is © Crown copyright and available under the terms of the Open Government 3.0 licence, except where otherwise stated.