The package provides base functionality to push R
objects to servers
using the DataSHIELD](https://www.datashield.org/) infrastructure for
distributed computing. Additionally, it is possible to calculate
predictions on the server for a specific model. Combining these allows
to push a model from the local machine to all servers running DataSHIELD
and predicting on that model with data exclusively hold by the server.
The predictions are stored at the server and can be further analysed
using the DataSHIELD functionality for non-disclosive analyses.
At the moment, there is no CRAN version available. Install the development version from GitHub:
remotes::install_github("difuture-lmu/dsPredictBase")
It is necessary to register the assign and aggregate methods in the OPAL
administration. These methods are registered automatically when
publishing the package on OPAL (see
DESCRIPTION
).
Note that the package needs to be installed at both locations, the server and the analysts machine.
library(DSI)
#> Loading required package: progress
#> Loading required package: R6
library(DSOpal)
#> Loading required package: opalr
#> Loading required package: httr
library(dsBaseClient)
library(dsPredictBase)
builder = newDSLoginBuilder()
surl = "https://opal-demo.obiba.org/"
username = "administrator"
password = "password"
builder$append(
server = "ds1",
url = surl,
user = username,
password = password,
table = "CNSIM.CNSIM1"
)
builder$append(
server = "ds2",
url = surl,
user = username,
password = password,
table = "CNSIM.CNSIM2"
)
connections = datashield.login(logins = builder$build(), assign = TRUE)
#>
#> Logging into the collaborating servers
#>
#> No variables have been specified.
#> All the variables in the table
#> (the whole dataset) will be assigned to R!
#>
#> Assigning table data...
### Get available tables:
datashield.symbols(connections)
#> $ds1
#> [1] "D"
#>
#> $ds2
#> [1] "D"
# Model was fitted on the CNSIM data provided by DataSHIELD. The
# response variable is if a patient have had diabetes or not.
load("inst/extdata/mod.Rda")
summary(mod)
#>
#> Call:
#> glm(formula = DIS_DIAB ~ LAB_TSC + LAB_TRIG + LAB_HDL + LAB_GLUC_ADJUSTED +
#> GENDER + DIS_CVA + MEDI_LPD + DIS_AMI, family = binomial(),
#> data = local_data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.4261 -0.1585 -0.1203 -0.0902 3.6771
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -6.90668 1.23102 -5.611 2.02e-08 ***
#> LAB_TSC -0.08805 0.12658 -0.696 0.4867
#> LAB_TRIG 0.18967 0.10105 1.877 0.0605 .
#> LAB_HDL -0.24500 0.35656 -0.687 0.4920
#> LAB_GLUC_ADJUSTED 0.45802 0.06535 7.009 2.41e-12 ***
#> GENDER1 -0.56792 0.32419 -1.752 0.0798 .
#> DIS_CVA1 -9.81495 1455.39758 -0.007 0.9946
#> MEDI_LPD1 2.12107 0.46595 4.552 5.31e-06 ***
#> DIS_AMI1 -12.73821 652.64901 -0.020 0.9844
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 566.14 on 4095 degrees of freedom
#> Residual deviance: 476.68 on 4087 degrees of freedom
#> AIC: 494.68
#>
#> Number of Fisher Scoring iterations: 14
Upload model to DataSHIELD server:
pushObject(connections, mod)
#> [2022-04-21 12:51:33] Your object is bigger than 1 MB (2.4 MB). Uploading larger objects may take some time.
# Check if model "mod" is now available:
DSI::datashield.symbols(connections)
#> $ds1
#> [1] "D" "mod"
#>
#> $ds2
#> [1] "D" "mod"
# Check class of uploaded "mod":
ds.class("mod")
#> $ds1
#> [1] "glm" "lm"
#>
#> $ds2
#> [1] "glm" "lm"
Now predict on uploaded model and data set “D” and store as object “pred”:
predictModel(connections, mod, "pred", "D")
# Check if prediction "pred" is now available:
datashield.symbols(connections)
#> $ds1
#> [1] "D" "mod" "pred"
#>
#> $ds2
#> [1] "D" "mod" "pred"
# Summary of "pred":
ds.summary("pred")
#> $ds1
#> $ds1$class
#> [1] "numeric"
#>
#> $ds1$length
#> [1] 2163
#>
#> $ds1$`quantiles & mean`
#> 5% 10% 25% 50% 75% 90% 95% Mean
#> -6.219511 -5.933623 -5.451908 -4.892368 -4.330816 -3.828193 -3.484391 -4.871689
#>
#>
#> $ds2
#> $ds2$class
#> [1] "numeric"
#>
#> $ds2$length
#> [1] 3088
#>
#> $ds2$`quantiles & mean`
#> 5% 10% 25% 50% 75% 90% 95% Mean
#> -6.241525 -5.940107 -5.476556 -4.904900 -4.336034 -3.839188 -3.426842 -4.879383
Now do the same but assign the values using response type “response”:
predictModel(connections, mod, "pred", "D", predict_fun = "predict(mod, newdata = D, type = 'response')")
ds.summary("pred")
#> $ds1
#> $ds1$class
#> [1] "numeric"
#>
#> $ds1$length
#> [1] 2163
#>
#> $ds1$`quantiles & mean`
#> 5% 10% 25% 50% 75% 90%
#> 0.001986267 0.002641871 0.004269807 0.007447750 0.012985956 0.021285935
#> 95% Mean
#> 0.029759964 0.012757105
#>
#>
#> $ds2
#> $ds2$class
#> [1] "numeric"
#>
#> $ds2$length
#> [1] 3088
#>
#> $ds2$`quantiles & mean`
#> 5% 10% 25% 50% 75% 90%
#> 0.001943102 0.002624839 0.004166283 0.007355694 0.012919244 0.021058086
#> 95% Mean
#> 0.031467146 0.013243564
datashield.logout(connections)
Build by root (machine 20.6.0) on 2022-04-21 12:52:07.
This readme is built automatically after each push to the repository.
Hence, it also is a test if the functionality of the package works also
on the DataSHIELD servers. We also test these functionality in
tests/testthat/test_on_active_server.R
. The system information of the
local and remote servers are as followed:
- Local machine:
R
version: R version 4.1.3 (2022-03-10)- Version of DataSHELD client packages:
Package | Version |
---|---|
DSI | 1.4.0 |
DSOpal | 1.3.1 |
dsBaseClient | 6.2.0 |
dsPredictBase | 0.0.1 |
- Remote DataSHIELD machines:
R
version of ds1: R version 4.1.3 (2022-03-10)R
version of ds2: R version 4.1.3 (2022-03-10)- Version of server packages:
Package | ds1: Version | ds2: Version |
---|---|---|
dsBase | 6.2.0 | 6.2.0 |
resourcer | 1.2.0 | 1.2.0 |
dsPredictBase | 0.0.1 | 0.0.1 |